Local file recognition

Prev Next

Available in Classic and VPC

Recognize a locally stored audio/video file and convert it to text.

Request

This section describes the request format. The method and URI are as follows:

Method URI
POST /recognizer/upload

Request headers

For information about the headers common to all CLOVA Speech APIs, see Common CLOVA Speech headers.

Request body

You can include the following data in the body of your request:

Field Type Required Description
media File Required Local audio/video file
  • Supported file formats
    • Audio: MP3, AAC, AC3, OGG, FLAC, WAV, M4A
    • Video: AVI, MP4, MOV, WMV, FLV, MKV
params Object Required Parameter details
params.language String Required Text recognition language
  • ko-KR (default) | en-US | enko | ja | zh-cn | zh-tw
    • ko-kR: Korean
    • en-US: English
    • enko: Korean/English simultaneous recognition
    • ja: Japanese
    • zh-cn: Chinese (Simplified)
    • zh-tw: Chinese (Traditional)
params.completion String Optional Response method after recognition request
  • sync | async (default)
    • sync: Return results in JSON format.
    • async: Return in the form of callback URL or resultToObs (ObjectStorage).
params.callback String Conditional Callback URL
  • If completion is async, either callback or resultToObs must be entered.
params.wordAlignment Boolean Optional Whether to output speech and text alignment of recognition results
  • true (default) | false
    • true: output
    • false: no output
params.fullText Boolean Optional Whether to output full recognition result text
  • true (default) | false
    • true: output
    • false: no output
params.resultToObs Boolean Conditional Whether to save results in Object Storage
  • Operate only if completion is async.
  • true | false (default)
    • true: results saved
    • false: results not saved
params.noiseFiltering Boolean Optional Noise filtering
  • true (default) | false
    • true: filtered
    • false: not filtered
params.boostings Array Optional Keyword boosting details
  • List of keywords to boost speech recognition for
  • Can't be used concurrently with useDomainBoostings.
  • Up to 1000 entries allowed
  • Only available in Korean and English
    • English: lowercase conversion by default, capitalize keywords requested for boosting
  • No boosting for single-syllable words due to risk of misidentification
    • Example: yes, yeah, no
  • Boosting is applied regardless of spacing.
    • Example: Request boosting for only one keyword between "CLOVA Speech" and "CLOVASpeech."
  • There is no restriction on keyword length, but if the phrase to be boosted is a combination of multiple words, it will not be affected by boosting unless it is that exact phrase.
    • Example: If you boost the keyword "CLOVA Speech," all sentences containing "CLOVA Speech" will be affected by boosting.
    • Example: If you boost a long keyword such as "CLOVA Speech's media speech recognition technology," sentences that contain only "CLOVA Speech" are unlikely to be affected by boosting.
params.useDomainBoostings Boolean Optional Whether to use domain boosting
  • true | false (default)
    • true: boosting used
    • false: boosting not used
  • Can't be used concurrently with boostings.
params.forbiddens String Optional Sensitive keywords
  • List of keywords to reduce the speech recognition rate (if you don't want them to appear in the recognition results)
  • No limit on the number and length of keywords
  • Spaces and capitalization are required to be matched exactly.
params.diarization Object Optional Speaker recognition details
params.diarization.enable Boolean Optional Whether to recognize speaker
  • true (default) | false
    • true: speaker recognized
    • false: speaker not recognized
sed Object Optional Event detection result details
sed.enable Boolean Optional Whether to detect events
  • true | false (default)
    • true: event detected
    • false: event not detected
format String Optional Response result return format
  • JSON (default) | SRT | SMI

params.boostings

The following describes params.boostings.

Field Type Required Description
words String Optional List of words to keyword boost
Note

When requesting completion (request-and-response method) as async, the recognition result is returned as follows depending on whether there is a callback URL address or resultToObs(ObjectStorage) entered.

Callback URL resultToObs(ObjectStorage) Result
URL address exists. True Return results to both callback URL and Object Storage.
URL address exists. False Return results only to the callback URL.
URL address doesn't exist. True Return results only to Object Storage.
URL address doesn't exist. False Return an error.

Request example

The request example is as follows:

curl --location --request POST 'https://clovaspeech-gw.ncloud.com/external/v1/8881/5f7e1b4c866f1c605946c9236f9***********/recognizer/upload' \
--header 'Content-Type: multipart/form-data' \
--header 'X-CLOVASPEECH-API-KEY: {Secret key issued when registering the app}' \
--form 'media=@"{media}"' \
--form 'params="{\"language\":\"ko-KR\",\"completion\":\"sync\", \"callback\":\"\", \"fullText\":true,\"\"}"' \
--form 'type="application/json"'

Response

This section describes the response format.

Response body

The response body includes the following data:

Field Type Required Description
result String - Response code
message String - Response message
token String - Result token
version String - Engine version
params Object - Parameter details
params.service String - Service code
params.domain String - Domain type
  • Use when calling the engine.
  • general
params.lang String - Recognition language
  • ko | en | enko | ja | zh-cn | zh-tw
    • ko: Korean
    • en: English
    • enko: Korean/English simultaneous translation
    • ja: Japanese
    • zh-cn: Chinese (Simplified)
    • zh-tw: Chinese (Traditional)
params.completion String - Response method after recognition request
  • sync: Return results in JSON format
  • async: Return in the form of callback URL or resultToObs (ObjectStorage).
params.callback String - Callback URL
params.diarization Object - Speaker recognition (separation) details
params.diarization.enable Boolean - Whether to recognize (separate) speaker
  • true | false
    • true: speaker recognized
    • false: speaker not recognized
params.diarization.speakerCountMin Integer - Minimum number of speakers
params.diarization.speakerCountMax Integer - Maximum number of speakers
params.sed Object - Event detection result
params.sed.enable Boolean - Whether to detect events
  • true | false (default)
    • true: event detected
    • false: event not detected
params.boostings Array - Keyword boosting details
params.forbiddens String - Sensitive keywords
params.wordAlignment Boolean Optional Whether to output speech and text alignment of recognition results
  • true (default) | false
    • true: output
    • false: no output
params.fullText Boolean - Whether to output full recognition result text
  • true (default) | false
    • true: output
    • false: no output
params.noiseFiltering Boolean - Noise filtering
  • true (default) | false
    • true: filtered
    • false: not filtered
params.resultToObs Boolean - Whether to save results in Object Storage
  • Operate only if completion is async.
  • true | false (default)
    • true: results saved
    • false: results not saved
params.priority Integer - Priority
  • 0-4
  • The lower the number, the higher the priority
params.userdata Object - User data details
params.userdata._ncp_DomainCode String - Domain code
  • long-speech | short-speech
    • long-speech: long sentence recognition
    • short-speech: short sentence recognition
params.userdata._ncp_DomainId Integer - Domain ID
params.userdata._ncp_TaskId Integer - Task ID
  • Use to track specific recognition tasks.
params.userdata._ncp_TraceId String - Trace ID
  • Use to track logs.
progress Integer - Recognition progress
segments Array - segments details
text String - Overall text
confidence Double - Overall accuracy
speakers Array - All speaker details
events Array - Event details
eventTypes Array - Details of all recognized events

params.boostings

The following describes params.boostings.

Field Type Required Description
words String - List of words to keyword boost

segments

The following describes #segments.

Field Type Required Description
start Long - Analysis start time (ms)
end Long - Analysis end time (ms)
text String - Analyzed text
confidence Double - Analysis accuracy
  • 0.0-1.0
diarization Object - Recognized speaker details
diarization.label String - Recognized speaker's number
speaker Object - Changed speaker's details
speaker.label String - Changed speaker's number
speaker.name String - Changed speaker's name
speaker.edited Boolean - Whether speaker is changed
  • true | false (default)
    • true: speaker changed
    • false: speaker same
words Array<Long, Long, String> - List of recognized words
words.[0] Long - Segment start time (ms)
words.[1] Long - Segment end time (ms)
words.[2] String - Segment text
textEdited String - Modification details

speakers

The following describes speakers.

Field Type Required Description
label String - Numbers of all speakers
name String - Names of all speakers
edited Boolean - Whether speaker is changed
  • true | false (default)
    • true: speaker changed
    • false: speaker same

events

The following describes events.

Field Type Required Description
type String - Event type
label String - Event name
labelEdited String - Event change name
start Long - Event start time
end Long - Event end time

eventTypes

The following describes eventTypes.

Field Type Required Description
label String - Recognized event

Response status codes

For information about the HTTP status codes common to all CLOVA Speech APIs, see Common CLOVA Speech response status codes.

Response example

The response example is as follows:

Request with async and return in JSON

The following is a sample response requested with async and returned in JSON format.

{
    "token": "*****f6a1015466bae2c926177f26310",
    "result": "SUCCEEDED",
    "message": "Succeeded"
}

Request with sync and return in JSON

The following is a sample response requested with sync and returned in JSON format.

{
    "result": "COMPLETED",
    "message": "Succeeded",
    "token": "*****166039e486abbb90e4a84c3b3a5",
    "version": "ncp_v2_v2.3.0-aa6cd8d-20231205_231211-3cf30bfc_v0.0.0_",
    "params": {
        "service": "ncp",
        "domain": "general",
        "lang": "enko",
        "completion": "sync",
        "callback": "",
        "diarization": {
            "enable": true,
            "speakerCountMin": -1,
            "speakerCountMax": -1
        },
        "sed": {
            "enable": true
        },
        "boostings": [
            {
                "words": "Hello, test"
                "weight": 1
            }
        ],
        "forbiddens": "",
        "wordAlignment": true,
        "fullText": true,
        "noiseFiltering": true,
        "resultToObs": false,
        "priority": 0,
        "userdata": {
            "_ncp_DomainCode": "NEST",
            "_ncp_DomainId": 1,
            "_ncp_TaskId": **442,
            "_ncp_TraceId": "*****ce98ec342d8a8c8fe9191cec343",
            "id": 1
        }
    },
    "progress": 100,
    "keywords": {},
    "segments": [
        {
            "start": 5870,
            "end": 8160,
            "text": "This is the Seoul swimming pool.",
            "confidence": 0.9626975,
            "diarization": {
                "label": "2"
            },
            "speaker": {
                "label": "2",
                "name": "B",
                "edited": false
            },
            "words": [
                [
                    5871,
                    6730,
                    "This is the Seoul"
                ],
                [
                    6860,
                    7530,
                    "swimming pool."
                ]
            ],
            "textEdited": "This is the Seoul swimming pool."
        },
        {
            "start": 8160,
            "end": 12950,
            "text": "How much is the entry fee? It's 5000 KRW. Thank you.",
            "confidence": 0.8835926,
            "diarization": {
                "label": "1"
            },
            "speaker": {
                "label": "1",
                "name": "A",
                "edited": false
            },
            "words": [
                [
                    8161,
                    9220,
                    "How much is"
                ],
                [
                    9390,
                    10020,
                    "the entry fee?"
                ],
                [
                    10410,
                    10640,
                    "It's 5000"
                ],
                [
                    10710,
                    11140,
                    "KRW."
                ],
                [
                    11910,
                    12500,
                    "Thank you."
                ]
            ],
            "textEdited": "How much is the entry fee? It's 5000 KRW. Thank you."
        }
    ],
    "text": "This is the Seoul swimming pool. How much is the entry fee? It's 5000 KRW. Thank you.",
    "confidence": 0.9071357,
    "speakers": [
        {
            "label": "1",
            "name": "A",
            "edited": false
        },
        {
            "label": "2",
            "name": "B",
            "edited": false
        }
    ],
    "events": [
        {
            "type": "music",
            "label": "music",
            "labelEdited": "music",
            "start": 1400,
            "end": 5000
        }
    ],
    "eventTypes": [
        "music"
    ]
}

Request with sync and return in SRT

The following is a sample response requested with sync and returned in SRT format.

1
00:00:00,000 --> 00:00:01,425
A: Not long ago,

2
00:00:02,533 --> 00:00:11,550
A: I had some corn. It was really sweet and delicious, but I thought it was the name of a neighborhood.

3
00:00:11,550 --> 00:00:19,025
A: I didn't know it was "cho" from "Chosaier" and "dang" which meant sweet. I didn't know. I thought chodang was the same word used for Chodang tofu.

4
00:00:19,025 --> 00:00:26,317
C: You thought of saccharin, a bit. You had it super sweet.

5
00:00:26,317 --> 00:00:28,240
A: Is it corn?

6
00:00:28,240 --> 00:00:35,318
B: Where can you find sweet tofu? This do doesn't understand. Isn't Sangdo in the Chodang area?

7
00:00:35,318 --> 00:00:42,800
A: No, Chodang corn meant super sweet. No one has understood right now.

Request with sync and return in SMI

The following is a sample response requested with sync and returned in SMI format.

<SAMI>
<Body>
  <SYNC Start=0>
    <P>A: Not long ago,
  <SYNC Start=2533>
    <P>A: I had some corn. It was really sweet and delicious, but I thought it was the name of a neighborhood.
  <SYNC Start=11550>
    <P>A: I didn't know it was "cho" from "Chosaier" and "dang" which meant sweet. I didn't know. I thought chodang was the same word used for Chodang tofu.
  <SYNC Start=19025>
    <P>C: You thought of saccharin, a bit. You had it super sweet.
  <SYNC Start=26317>
    <P>A: Is it corn?
  <SYNC Start=28240>
    <P>B: Where can you find sweet tofu? This do doesn't understand. Isn't Sangdo in the Chodang area?
  <SYNC Start=35318>
    <P>A: No, Chodang corn meant super sweet. No one has understood right now.
</Body>
</SAMI>