Local file recognition

English

Local file recognition

Article summary

Did you find this summary helpful?

Thank you for your feedback

Available in Classic and VPC

Recognize a locally stored audio/video file and convert it to text.

Request

The following describes the request format for the endpoint. The request format is as follows:

Method	URI
POST	/recognizer/upload

Request headers

For headers common to all CLOVA Speech APIs, see Common CLOVA Speech headers.

Request body

The following describes the request body.

Field	Type	Required	Description
`media`	File	Required	Local audio/video file Supported file formats Audio: `MP3`, `AAC`, `AC3`, `OGG`, `FLAC`, `WAV`, `M4A` Video: `AVI`, `MP4`, `MOV`, `WMV`, `FLV`, `MKV`
`params`	Object	Required	Parameter details
`params.language`	String	Required	Text recognition language `ko-KR` (default) \| `en-US` \| `enko` \| `ja` \| `zh-cn` \| `zh-tw` `ko-kR`: Korean `en-US`: English `enko`: Korean/English simultaneous recognition `ja`: Japanese `zh-cn`: Chinese (Simplified) `zh-tw`: Chinese (Traditional)
`params.completion`	String	Optional	Response method after recognition request `sync` \| `async` (default) `sync`: Return results in JSON format `async`: Return in the form of callback URL or `resultToObs` (ObjectStorage)
`params.callback`	String	Conditional	Callback URL If `completion` is `async`, either `callback` or `resultToObs` must be entered
`params.wordAlignment`	Boolean	Optional	Whether to output speech and text alignment of recognition results `true` (default) \| `false` `true`: output `false`: no output
`params.fullText`	Boolean	Optional	Whether to output full recognition result text `true` (default) \| `false` `true`: output `false`: no output
`params.resultToObs`	Boolean	Conditional	Whether to save results in Object Storage Operate only if `completion` is `async` `true` \| `false` (default) `true`: results saved `false`: results not saved
`params.noiseFiltering`	Boolean	Optional	Noise filtering `true` (default) \| `false` `true`: filtered `false`: not filtered
`params.boostings`	Array	Optional	Keyword boosting details List of keywords to boost speech recognition for Can't be used concurrently with `useDomainBoostings` Up to 1000 entries allowed Only available in Korean and English English: lowercase conversion by default, capitalize keywords requested for boosting No boosting for single-syllable words due to risk of misidentification <e.g.> `yes`, `yeah`, `no` Boosting is applied regardless of spacing <e.g.> Request boosting for only one keyword between "CLOVA Speech" and "CLOVASpeech" There is no restriction on keyword length, but if the phrase to be boosted is a combination of multiple words, it will not be affected by boosting unless it is that exact phrase <e.g.> If you boost the keyword "CLOVA Speech," all sentences containing "CLOVA Speech" will be affected by boosting <e.g.> If you boost a long keyword such as "CLOVA Speech's media speech recognition technology," sentences that contain only "CLOVA Speech" are unlikely to be affected by boosting
`params.useDomainBoostings`	Boolean	Optional	Whether to use domain boosting `true` \| `false` (default) `true`: boosting used `false`: boosting not used Can't be used concurrently with `boostings`
`params.forbiddens`	String	Optional	Sensitive keywords List of keywords to reduce the speech recognition rate (if you don't want them to appear in the recognition results) No limit on the number and length of keywords Spaces and capitalization are required to be matched exactly
`params.diarization`	Object	Optional	Speaker recognition details
`params.diarization.enable`	Boolean	Optional	Whether to recognize speaker `true` (default) \| `false` `true`: speaker recognized `false`: speaker not recognized
`sed`	Object	Optional	Event detection result details
`sed.enable`	Boolean	Optional	Whether to detect events `true` \| `false` (default) `true`: event detected `false`: event not detected
`format`	String	Optional	Response result return format `JSON` (default) \| `SRT` \| `SMI`

params.boostings

The following describes params.boostings.

Field	Type	Required	Description
`words`	String	Optional	List of words to keyword boost

Note

When requesting completion (request-and-response method) as async, the recognition result is returned as follows depending on whether there is a callback URL address or resultToObs(ObjectStorage) entered.

Callback URL	resultToObs(ObjectStorage)	Result
URL address exists	True	Return results to both callback URL and Object Storage
URL address exists	False	Return results only to the callback URL
URL address doesn't exist	True	Return results only to Object Storage
URL address doesn't exist	False	Return an error

Request example

The following is a sample request.

curl --location --request POST 'https://clovaspeech-gw.ncloud.com/external/v1/8881/5f7e1b4c866f1c605946c9236f9***********/recognizer/upload' \
--header 'Content-Type: multipart/form-data' \
--header 'X-CLOVASPEECH-API-KEY: {Secret key issued when registering the app}' \
--form 'media=@"{media}"' \
--form 'params="{\"language\":\"ko-KR\",\"completion\":\"sync\", \"callback\":\"\", \"fullText\":true,\"\"}"' \
--form 'type="application/json"'

Response

The following describes the response format.

Response body

The following describes the response body.

Field	Type	Required	Description
`result`	String	-	Response code
`message`	String	-	Response message
`token`	String	-	Result token
`version`	String	-	Engine version
`params`	Object	-	Parameter details
`params.service`	String	-	Service code
`params.domain`	String	-	Domain type Use when calling the engine `general`
`params.lang`	String	-	Recognition language `ko` \| `en` \| `enko` \| `ja` \| `zh-cn` \| `zh-tw` `ko`: Korean `en`: English `enko`: Korean/English simultaneous translation `ja`: Japanese `zh-cn`: Chinese (Simplified) `zh-tw`: Chinese (Traditional)
`params.completion`	String	-	Response method after recognition request `sync`: Return results in JSON format `async`: Return in the form of callback URL or `resultToObs` (ObjectStorage)
`params.callback`	String	-	Callback URL
`params.diarization`	Object	-	Speaker recognition (separation) details
`params.diarization.enable`	Boolean	-	Whether to recognize (separate) speaker `true` \| `false` `true`: speaker recognized `false`: speaker not recognized
`params.diarization.speakerCountMin`	Integer	-	Minimum number of speakers
`params.diarization.speakerCountMax`	Integer	-	Maximum number of speakers
`params.sed`	Object	-	Event detection result
`params.sed.enable`	Boolean	-	Whether to detect events `true` \| `false` (default) `true`: event detected `false`: event not detected
`params.boostings`	Array	-	Keyword boosting details For more information, see `boostings` of Request body
`params.forbiddens`	String	-	Sensitive keywords For more information, see `forbiddens` of Request body
`params.wordAlignment`	Boolean	Optional	Whether to output speech and text alignment of recognition results `true` (default) \| `false` `true`: output `false`: no output
`params.fullText`	Boolean	-	Whether to output full recognition result text `true` (default) \| `false` `true`: output `false`: no output
`params.noiseFiltering`	Boolean	-	Noise filtering `true` (default) \| `false` `true`: filtered `false`: not filtered
`params.resultToObs`	Boolean	-	Whether to save results in Object Storage Operate only if `completion` is `async` `true` \| `false` (default) `true`: results saved `false`: results not saved
`params.priority`	Integer	-	Priority 0 - 4 The lower the number, the higher the priority
`params.userdata`	Object	-	User data details
`params.userdata._ncp_DomainCode`	String	-	Domain code `long-speech` \| `short-speech` `long-speech`: long sentence recognition `short-speech`: short sentence recognition
`params.userdata._ncp_DomainId`	Integer	-	Domain ID
`params.userdata._ncp_TaskId`	Integer	-	Task ID Use to track specific recognition tasks
`params.userdata._ncp_TraceId`	String	-	Trace ID Use to track logs
`progress`	Integer	-	Recognition progress
`segments`	Array	-	segments details
`text`	String	-	Overall text
`confidence`	Double	-	Overall accuracy
`speakers`	Array	-	All speaker details
`events`	Array	-	Event details
`eventTypes`	Array	-	Details of all recognized events

params.boostings

The following describes params.boostings.

Field	Type	Required	Description
`words`	String	-	List of words to keyword boost

segments

The following describes #segments.

Field	Type	Required	Description
`start`	Long	-	Analysis start time (ms)
`end`	Long	-	Analysis end time (ms)
`text`	String	-	Analyzed text
`confidence`	Double	-	Analysis accuracy 0.0 - 1.0
`diarization`	Object	-	Recognized speaker details
`diarization.label`	String	-	Recognized speaker's number
`speaker`	Object	-	Changed speaker's details
`speaker.label`	String	-	Changed speaker's number
`speaker.name`	String	-	Changed speaker's name
`speaker.edited`	Boolean	-	Whether speaker is changed `true` \| `false` (default) `true`: speaker changed `false`: speaker same
`words`	Array<Long, Long, String>	-	List of recognized words
`words.[0]`	Long	-	Segment start time (ms)
`words.[1]`	Long	-	Segment end time (ms)
`words.[2]`	String	-	Segment text
`textEdited`	String	-	Modification details

speakers

The following describes speakers.

Field	Type	Required	Description
`label`	String	-	Numbers of all speakers
`name`	String	-	Names of all speakers
`edited`	Boolean	-	Whether speaker is changed `true` \| `false` (default) `true`: speaker changed `false`: speaker same

events

The following describes events.

Field	Type	Required	Description
`type`	String	-	Event type
`label`	String	-	Event name
`labelEdited`	String	-	Event change name
`start`	Long	-	Event start time
`end`	Long	-	Event end time

eventTypes

The following describes eventTypes.

Field	Type	Required	Description
`label`	String	-	Recognized event

Response status codes

For response status codes common to all CLOVA Speech APIs, see Common CLOVA Speech response status codes.

Response example

The following is a sample example.

Request with `async` and return in JSON

The following is a sample response requested with async and returned in JSON format.

{
    "token": "*****f6a1015466bae2c926177f26310",
    "result": "SUCCEEDED",
    "message": "Succeeded"
}

Request with `sync` and return in JSON

The following is a sample response requested with sync and returned in JSON format.

{
    "result": "COMPLETED",
    "message": "Succeeded",
    "token": "*****166039e486abbb90e4a84c3b3a5",
    "version": "ncp_v2_v2.3.0-aa6cd8d-20231205_231211-3cf30bfc_v0.0.0_",
    "params": {
        "service": "ncp",
        "domain": "general",
        "lang": "enko",
        "completion": "sync",
        "callback": "",
        "diarization": {
            "enable": true,
            "speakerCountMin": -1,
            "speakerCountMax": -1
        },
        "sed": {
            "enable": true
        },
        "boostings": [
            {
                "words": "Hello, test"
            }
        ],
        "forbiddens": "",
        "wordAlignment": true,
        "fullText": true,
        "noiseFiltering": true,
        "resultToObs": false,
        "priority": 0,
        "userdata": {
            "_ncp_DomainCode": "NEST",
            "_ncp_DomainId": 1,
            "_ncp_TaskId": **442,
            "_ncp_TraceId": "*****ce98ec342d8a8c8fe9191cec343",
            "id": 1
        }
    },
    "progress": 100,
    "keywords": {},
    "segments": [
        {
            "start": 5870,
            "end": 8160,
            "text": "This is the Seoul swimming pool.",
            "confidence": 0.9626975,
            "diarization": {
                "label": "2"
            },
            "speaker": {
                "label": "2",
                "name": "B",
                "edited": false
            },
            "words": [
                [
                    5871,
                    6730,
                    "This is the Seoul"
                ],
                [
                    6860,
                    7530,
                    "swimming pool."
                ]
            ],
            "textEdited": "This is the Seoul swimming pool."
        },
        {
            "start": 8160,
            "end": 12950,
            "text": "How much is the entry fee? It's 5000 KRW. Thank you.",
            "confidence": 0.8835926,
            "diarization": {
                "label": "1"
            },
            "speaker": {
                "label": "1",
                "name": "A",
                "edited": false
            },
            "words": [
                [
                    8161,
                    9220,
                    "How much is"
                ],
                [
                    9390,
                    10020,
                    "the entry fee?"
                ],
                [
                    10410,
                    10640,
                    "It's 5000"
                ],
                [
                    10710,
                    11140,
                    "KRW."
                ],
                [
                    11910,
                    12500,
                    "Thank you."
                ]
            ],
            "textEdited": "How much is the entry fee? It's 5000 KRW. Thank you."
        }
    ],
    "text": "This is the Seoul swimming pool. How much is the entry fee? It's 5000 KRW. Thank you.",
    "confidence": 0.9071357,
    "speakers": [
        {
            "label": "1",
            "name": "A",
            "edited": false
        },
        {
            "label": "2",
            "name": "B",
            "edited": false
        }
    ],
    "events": [
        {
            "type": "music",
            "label": "music",
            "labelEdited": "music",
            "start": 1400,
            "end": 5000
        }
    ],
    "eventTypes": [
        "music"
    ]
}

Request with `sync` and return in SRT

The following is a sample response requested with sync and returned in SRT format.

1
00:00:00,000 --> 00:00:01,425
A: Not long ago,

2
00:00:02,533 --> 00:00:11,550
A: I had some corn. It was really sweet and delicious, but I thought it was the name of a neighborhood.

3
00:00:11,550 --> 00:00:19,025
A: I didn't know it was "cho" from "Chosaier" and "dang" which meant sweet. I didn't know. I thought chodang was the same word used for Chodang tofu.

4
00:00:19,025 --> 00:00:26,317
C: You thought of saccharin, a bit. You had it super sweet.

5
00:00:26,317 --> 00:00:28,240
A: Is it corn?

6
00:00:28,240 --> 00:00:35,318
B: Where can you find sweet tofu? This do doesn't understand. Isn't Sangdo in the Chodang area?

7
00:00:35,318 --> 00:00:42,800
A: No, Chodang corn meant super sweet. No one has understood right now.

Request with `sync` and return in SMI

The following is a sample response requested with sync and returned in SMI format.

<SAMI>
<Body>
  <SYNC Start=0>
    <P>A: Not long ago,
  <SYNC Start=2533>
    <P>A: I had some corn. It was really sweet and delicious, but I thought it was the name of a neighborhood.
  <SYNC Start=11550>
    <P>A: I didn't know it was "cho" from "Chosaier" and "dang" which meant sweet. I didn't know. I thought chodang was the same word used for Chodang tofu.
  <SYNC Start=19025>
    <P>C: You thought of saccharin, a bit. You had it super sweet.
  <SYNC Start=26317>
    <P>A: Is it corn?
  <SYNC Start=28240>
    <P>B: Where can you find sweet tofu? This do doesn't understand. Isn't Sangdo in the Chodang area?
  <SYNC Start=35318>
    <P>A: No, Chodang corn meant super sweet. No one has understood right now.
</Body>
</SAMI>

Was this article helpful?

What's Next

Check job status

Table of contents

Request
Response

Local file recognition

Request

Request headers

Request body

params.boostings

Request example

Response

Response body

params.boostings

segments

speakers

events

eventTypes

Response status codes

Response example

Request with async and return in JSON

Request with sync and return in JSON

Request with sync and return in SRT

Request with sync and return in SMI

What's Next

Request with `async` and return in JSON

Request with `sync` and return in JSON

Request with `sync` and return in SRT

Request with `sync` and return in SMI