Local file recognition
    • PDF

    Local file recognition

    • PDF

    Article summary

    Available in Classic and VPC

    Recognize a locally stored audio/video file and convert it to text.

    Request

    The following describes the request format for the endpoint. The request format is as follows:

    MethodURI
    POST/recognizer/upload

    Request headers

    For headers common to all CLOVA Speech APIs, see Common CLOVA Speech headers.

    Request body

    The following describes the request body.

    FieldTypeRequiredDescription
    mediaFileRequiredLocal audio/video file
    • Supported file formats
      • Audio: MP3, AAC, AC3, OGG, FLAC, WAV, M4A
      • Video: AVI, MP4, MOV, WMV, FLV, MKV
    paramsObjectRequiredParameter details
    params.languageStringRequiredText recognition language
    • ko-KR (default) | en-US | enko | ja | zh-cn | zh-tw
      • ko-kR: Korean
      • en-US: English
      • enko: Korean/English simultaneous recognition
      • ja: Japanese
      • zh-cn: Chinese (Simplified)
      • zh-tw: Chinese (Traditional)
    params.completionStringOptionalResponse method after recognition request
    • sync | async (default)
      • sync: Return results in JSON format
      • async: Return in the form of callback URL or resultToObs (ObjectStorage)
    params.callbackStringConditionalCallback URL
    • If completion is async, either callback or resultToObs must be entered
    params.wordAlignmentBooleanOptionalWhether to output speech and text alignment of recognition results
    • true (default) | false
      • true: output
      • false: no output
    params.fullTextBooleanOptionalWhether to output full recognition result text
    • true (default) | false
      • true: output
      • false: no output
    params.resultToObsBooleanConditionalWhether to save results in Object Storage
    • Operate only if completion is async
    • true | false (default)
      • true: results saved
      • false: results not saved
    params.noiseFilteringBooleanOptionalNoise filtering
    • true (default) | false
      • true: filtered
      • false: not filtered
    params.boostingsArrayOptionalKeyword boosting details
    • List of keywords to boost speech recognition for
    • Can't be used concurrently with useDomainBoostings
    • Up to 1000 entries allowed
    • Only available in Korean and English
      • English: lowercase conversion by default, capitalize keywords requested for boosting
    • No boosting for single-syllable words due to risk of misidentification
      • <e.g.> yes, yeah, no
    • Boosting is applied regardless of spacing
      • <e.g.> Request boosting for only one keyword between "CLOVA Speech" and "CLOVASpeech"
    • There is no restriction on keyword length, but if the phrase to be boosted is a combination of multiple words, it will not be affected by boosting unless it is that exact phrase
      • <e.g.> If you boost the keyword "CLOVA Speech," all sentences containing "CLOVA Speech" will be affected by boosting
      • <e.g.> If you boost a long keyword such as "CLOVA Speech's media speech recognition technology," sentences that contain only "CLOVA Speech" are unlikely to be affected by boosting
    params.useDomainBoostingsBooleanOptionalWhether to use domain boosting
    • true | false (default)
      • true: boosting used
      • false: boosting not used
    • Can't be used concurrently with boostings
    params.forbiddensStringOptionalSensitive keywords
    • List of keywords to reduce the speech recognition rate (if you don't want them to appear in the recognition results)
    • No limit on the number and length of keywords
    • Spaces and capitalization are required to be matched exactly
    params.diarizationObjectOptionalSpeaker recognition details
    params.diarization.enableBooleanOptionalWhether to recognize speaker
    • true (default) | false
      • true: speaker recognized
      • false: speaker not recognized
    sedObjectOptionalEvent detection result details
    sed.enableBooleanOptionalWhether to detect events
    • true | false (default)
      • true: event detected
      • false: event not detected
    formatStringOptionalResponse result return format
    • JSON (default) | SRT | SMI

    params.boostings

    The following describes params.boostings.

    FieldTypeRequiredDescription
    wordsStringOptionalList of words to keyword boost
    Note

    When requesting completion (request-and-response method) as async, the recognition result is returned as follows depending on whether there is a callback URL address or resultToObs(ObjectStorage) entered.

    Callback URLresultToObs(ObjectStorage)Result
    URL address existsTrueReturn results to both callback URL and Object Storage
    URL address existsFalseReturn results only to the callback URL
    URL address doesn't existTrueReturn results only to Object Storage
    URL address doesn't existFalseReturn an error

    Request example

    The following is a sample request.

    curl --location --request POST 'https://clovaspeech-gw.ncloud.com/external/v1/8881/5f7e1b4c866f1c605946c9236f9***********/recognizer/upload' \
    --header 'Content-Type: multipart/form-data' \
    --header 'X-CLOVASPEECH-API-KEY: {Secret key issued when registering the app}' \
    --form 'media=@"{media}"' \
    --form 'params="{\"language\":\"ko-KR\",\"completion\":\"sync\", \"callback\":\"\", \"fullText\":true,\"\"}"' \
    --form 'type="application/json"'
    

    Response

    The following describes the response format.

    Response body

    The following describes the response body.

    FieldTypeRequiredDescription
    resultString-Response code
    messageString-Response message
    tokenString-Result token
    versionString-Engine version
    paramsObject-Parameter details
    params.serviceString-Service code
    params.domainString-Domain type
    • Use when calling the engine
    • general
    params.langString-Recognition language
    • ko | en | enko | ja | zh-cn | zh-tw
      • ko: Korean
      • en: English
      • enko: Korean/English simultaneous translation
      • ja: Japanese
      • zh-cn: Chinese (Simplified)
      • zh-tw: Chinese (Traditional)
    params.completionString-Response method after recognition request
    • sync: Return results in JSON format
    • async: Return in the form of callback URL or resultToObs (ObjectStorage)
    params.callbackString-Callback URL
    params.diarizationObject-Speaker recognition (separation) details
    params.diarization.enableBoolean-Whether to recognize (separate) speaker
    • true | false
      • true: speaker recognized
      • false: speaker not recognized
    params.diarization.speakerCountMinInteger-Minimum number of speakers
    params.diarization.speakerCountMaxInteger-Maximum number of speakers
    params.sedObject-Event detection result
    params.sed.enableBoolean-Whether to detect events
    • true | false (default)
      • true: event detected
      • false: event not detected
    params.boostingsArray-Keyword boosting details
    params.forbiddensString-Sensitive keywords
    params.wordAlignmentBooleanOptionalWhether to output speech and text alignment of recognition results
    • true (default) | false
      • true: output
      • false: no output
    params.fullTextBoolean-Whether to output full recognition result text
    • true (default) | false
      • true: output
      • false: no output
    params.noiseFilteringBoolean-Noise filtering
    • true (default) | false
      • true: filtered
      • false: not filtered
    params.resultToObsBoolean-Whether to save results in Object Storage
    • Operate only if completion is async
    • true | false (default)
      • true: results saved
      • false: results not saved
    params.priorityInteger-Priority
    • 0 - 4
    • The lower the number, the higher the priority
    params.userdataObject-User data details
    params.userdata._ncp_DomainCodeString-Domain code
    • long-speech | short-speech
      • long-speech: long sentence recognition
      • short-speech: short sentence recognition
    params.userdata._ncp_DomainIdInteger-Domain ID
    params.userdata._ncp_TaskIdInteger-Task ID
    • Use to track specific recognition tasks
    params.userdata._ncp_TraceIdString-Trace ID
    • Use to track logs
    progressInteger-Recognition progress
    segmentsArray-segments details
    textString-Overall text
    confidenceDouble-Overall accuracy
    speakersArray-All speaker details
    eventsArray-Event details
    eventTypesArray-Details of all recognized events

    params.boostings

    The following describes params.boostings.

    FieldTypeRequiredDescription
    wordsString-List of words to keyword boost

    segments

    The following describes #segments.

    FieldTypeRequiredDescription
    startLong-Analysis start time (ms)
    endLong-Analysis end time (ms)
    textString-Analyzed text
    confidenceDouble-Analysis accuracy
    • 0.0 - 1.0
    diarizationObject-Recognized speaker details
    diarization.labelString-Recognized speaker's number
    speakerObject-Changed speaker's details
    speaker.labelString-Changed speaker's number
    speaker.nameString-Changed speaker's name
    speaker.editedBoolean-Whether speaker is changed
    • true | false (default)
      • true: speaker changed
      • false: speaker same
    wordsArray<Long, Long, String>-List of recognized words
    words.[0]Long-Segment start time (ms)
    words.[1]Long-Segment end time (ms)
    words.[2]String-Segment text
    textEditedString-Modification details

    speakers

    The following describes speakers.

    FieldTypeRequiredDescription
    labelString-Numbers of all speakers
    nameString-Names of all speakers
    editedBoolean-Whether speaker is changed
    • true | false (default)
      • true: speaker changed
      • false: speaker same

    events

    The following describes events.

    FieldTypeRequiredDescription
    typeString-Event type
    labelString-Event name
    labelEditedString-Event change name
    startLong-Event start time
    endLong-Event end time

    eventTypes

    The following describes eventTypes.

    FieldTypeRequiredDescription
    labelString-Recognized event

    Response status codes

    For response status codes common to all CLOVA Speech APIs, see Common CLOVA Speech response status codes.

    Response example

    The following is a sample example.

    Request with async and return in JSON

    The following is a sample response requested with async and returned in JSON format.

    {
        "token": "*****f6a1015466bae2c926177f26310",
        "result": "SUCCEEDED",
        "message": "Succeeded"
    }
    

    Request with sync and return in JSON

    The following is a sample response requested with sync and returned in JSON format.

    {
        "result": "COMPLETED",
        "message": "Succeeded",
        "token": "*****166039e486abbb90e4a84c3b3a5",
        "version": "ncp_v2_v2.3.0-aa6cd8d-20231205_231211-3cf30bfc_v0.0.0_",
        "params": {
            "service": "ncp",
            "domain": "general",
            "lang": "enko",
            "completion": "sync",
            "callback": "",
            "diarization": {
                "enable": true,
                "speakerCountMin": -1,
                "speakerCountMax": -1
            },
            "sed": {
                "enable": true
            },
            "boostings": [
                {
                    "words": "Hello, test"
                }
            ],
            "forbiddens": "",
            "wordAlignment": true,
            "fullText": true,
            "noiseFiltering": true,
            "resultToObs": false,
            "priority": 0,
            "userdata": {
                "_ncp_DomainCode": "NEST",
                "_ncp_DomainId": 1,
                "_ncp_TaskId": **442,
                "_ncp_TraceId": "*****ce98ec342d8a8c8fe9191cec343",
                "id": 1
            }
        },
        "progress": 100,
        "keywords": {},
        "segments": [
            {
                "start": 5870,
                "end": 8160,
                "text": "This is the Seoul swimming pool.",
                "confidence": 0.9626975,
                "diarization": {
                    "label": "2"
                },
                "speaker": {
                    "label": "2",
                    "name": "B",
                    "edited": false
                },
                "words": [
                    [
                        5871,
                        6730,
                        "This is the Seoul"
                    ],
                    [
                        6860,
                        7530,
                        "swimming pool."
                    ]
                ],
                "textEdited": "This is the Seoul swimming pool."
            },
            {
                "start": 8160,
                "end": 12950,
                "text": "How much is the entry fee? It's 5000 KRW. Thank you.",
                "confidence": 0.8835926,
                "diarization": {
                    "label": "1"
                },
                "speaker": {
                    "label": "1",
                    "name": "A",
                    "edited": false
                },
                "words": [
                    [
                        8161,
                        9220,
                        "How much is"
                    ],
                    [
                        9390,
                        10020,
                        "the entry fee?"
                    ],
                    [
                        10410,
                        10640,
                        "It's 5000"
                    ],
                    [
                        10710,
                        11140,
                        "KRW."
                    ],
                    [
                        11910,
                        12500,
                        "Thank you."
                    ]
                ],
                "textEdited": "How much is the entry fee? It's 5000 KRW. Thank you."
            }
        ],
        "text": "This is the Seoul swimming pool. How much is the entry fee? It's 5000 KRW. Thank you.",
        "confidence": 0.9071357,
        "speakers": [
            {
                "label": "1",
                "name": "A",
                "edited": false
            },
            {
                "label": "2",
                "name": "B",
                "edited": false
            }
        ],
        "events": [
            {
                "type": "music",
                "label": "music",
                "labelEdited": "music",
                "start": 1400,
                "end": 5000
            }
        ],
        "eventTypes": [
            "music"
        ]
    }
    

    Request with sync and return in SRT

    The following is a sample response requested with sync and returned in SRT format.

    1
    00:00:00,000 --> 00:00:01,425
    A: Not long ago,
    
    2
    00:00:02,533 --> 00:00:11,550
    A: I had some corn. It was really sweet and delicious, but I thought it was the name of a neighborhood.
    
    3
    00:00:11,550 --> 00:00:19,025
    A: I didn't know it was "cho" from "Chosaier" and "dang" which meant sweet. I didn't know. I thought chodang was the same word used for Chodang tofu.
    
    4
    00:00:19,025 --> 00:00:26,317
    C: You thought of saccharin, a bit. You had it super sweet.
    
    5
    00:00:26,317 --> 00:00:28,240
    A: Is it corn?
    
    6
    00:00:28,240 --> 00:00:35,318
    B: Where can you find sweet tofu? This do doesn't understand. Isn't Sangdo in the Chodang area?
    
    7
    00:00:35,318 --> 00:00:42,800
    A: No, Chodang corn meant super sweet. No one has understood right now.
    

    Request with sync and return in SMI

    The following is a sample response requested with sync and returned in SMI format.

    <SAMI>
    <Body>
      <SYNC Start=0>
        <P>A: Not long ago,
      <SYNC Start=2533>
        <P>A: I had some corn. It was really sweet and delicious, but I thought it was the name of a neighborhood.
      <SYNC Start=11550>
        <P>A: I didn't know it was "cho" from "Chosaier" and "dang" which meant sweet. I didn't know. I thought chodang was the same word used for Chodang tofu.
      <SYNC Start=19025>
        <P>C: You thought of saccharin, a bit. You had it super sweet.
      <SYNC Start=26317>
        <P>A: Is it corn?
      <SYNC Start=28240>
        <P>B: Where can you find sweet tofu? This do doesn't understand. Isn't Sangdo in the Chodang area?
      <SYNC Start=35318>
        <P>A: No, Chodang corn meant super sweet. No one has understood right now.
    </Body>
    </SAMI>
    

    Was this article helpful?

    Changing your password will log you out immediately. Use the new password to log back in.
    First name must have atleast 2 characters. Numbers and special characters are not allowed.
    Last name must have atleast 1 characters. Numbers and special characters are not allowed.
    Enter a valid email
    Enter a valid password
    Your profile has been successfully updated.