Live streaming recognition

Prev Next

Available in Classic and VPC

Recognize and convert real-time speech data in PCM format (headerless WAV files) at 16 kHz, 1 channel, 16 bits per sample into text. Access is available only via the gRPC protocol.

Request

This section describes the request format. The method and URI are as follows:

Host Port
clovaspeech-gw.ncloud.com 50051

Request order

To request recognition via gRPC:

  1. Preparation
  2. Config JSON
  3. Recognize
Note

This guide is based on the Rocky Linux environment.

1. Install and prepare protoc compiler

In preparation for using the API, install the protoc compiler by referring to the gRPC site. After installation, select your preferred language (Python or Java), then Call compiler using the 'nest.proto' file where the API interface is defined.
To install the protoc compiler and generate gRPC code:

  1. Connect remotely to the server where you want to install the protoc compiler.

  2. Install the packages and plugins for using gRPC.

    • Rocky Linux: Python

      # Check the latest status
      sudo dnf update
      
      # Install Python: Install Python on the Linux server.
      sudo dnf install python3
      
      # Install and upgrade pip: pip is a package installer for Python.
      sudo dnf install python3-pip
      pip3 install --upgrade pip
      
      # Install grpcio-tools: Install "grpcio-tools" using pip.
      pip3 install grpcio-tools
      
      # Create nest.proto file
      touch nest.proto
      
      # Compile nest.proto file with protoc compiler
      python3 -m grpc_tools.protoc -I=. --python_out=. --grpc_python_out=. nest.proto
      
    • Rocky Linux: Java

      # Download protoc-gen-grpc-java plugin (check https://github.com/grpc/grpc-java/releases for version number)
      curl -OL https://repo1.maven.org/maven2/io/grpc/protoc-gen-grpc-java/1.66.0/protoc-gen-grpc-java-1.66.0-linux-x86_64.exe
      
      # Add to path
      mv protoc-gen-grpc-java-1.66.0-linux-x86_64.exe /usr/local/bin/protoc-gen-grpc-java
      
      # Change to execute permission
      chmod +x /usr/local/bin/protoc-gen-grpc-java
      
      # Confirm installation
      protoc-gen-grpc-java --version
      
      # Create nest.proto file
      touch nest.proto
      
      # Compile nest.proto file with protoc compiler
      protoc --proto_path=. --java_out=output/directory --grpc-java_out=output/directory nest.proto
      
  3. Open the 'nest.proto' file, enter the following code, and generate the gRPC code.

    syntax = "proto3";
    option java_multiple_files = true;
    package com.nbp.cdncp.nest.grpc.proto.v1;
    
    enum RequestType {
      CONFIG = 0;
      DATA = 1;
    }
    
    message NestConfig {
      string config = 1;
    }
    
    message NestData {
      bytes chunk = 1;
      string extra_contents = 2;
    }
    message NestRequest {
      RequestType type = 1;
      oneof part {
        NestConfig config = 2;
        NestData data = 3;
      }
    }
    
    message NestResponse {
      string contents = 1;
    }
    service NestService {
      rpc recognize(stream NestRequest) returns (stream NestResponse){};
    }
    

2. Authorization

After completing gRPC code generation via the protoc compiler, proceed with authorization. Authorization involves including a bearer token in the Authorization header during API calls to verify the client's integrity with the server. Note the following when performing authorization.

  • Set up the gRPC channel and generate the stub, which is the client-side proxy for nest_grpc_pb2.
  • After generating the stub, execute the desired function by including metadata containing the authentication key in the recognize method.
    • The real-time streaming recognition API is not supported on the Free plan; it is only available on the Basic long sentence recognition plan.

Authentication header

The authorization header is as follows:

Header name Description
Authorization Bearer ${secretKey}

Authorization order

The authorization method is as follows:

Python

To authorize using Python in a Rocky Linux environment:

  1. Create a Python file. Here, the file name is specified as main.py.
    touch main.py
    
  2. Add the following content to main.py.
    import grpc
    import json
    
    import nest_pb2
    import nest_pb2_grpc
    
    channel = grpc.secure_channel(
            'clovaspeech-gw.ncloud.com:50051',
            grpc.ssl_channel_credentials()
    )
    client = NestServiceStub(channel)
    metadata = (("authorization", f"Bearer {secretKey}"),) # Lowercase authorization required, secretkey is verified in the long sentence recognition domain
    call = client.YourMethod(YourRequest(), metadata=metadata)
    

Java

To authorize using Java in a Rocky Linux environment:

  1. Create a Java file. Here, the file name is specified as main.java.
    touch main.java
    
  2. Add the following content to main.java.
    ManagedChannel channel = NettyChannelBuilder
                .forTarget("clovaspeech-gw.ncloud.com:50051")
                .useTransportSecurity()
                .build();
    NestServiceGrpc.NestServiceStub client = NestServiceGrpc.newStub(channel);
    Metadata metadata = new Metadata();
    metadata.put(Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER),
                 "Bearer ${secretKey}");
    client = MetadataUtils.attachHeaders(client, metadata);
    

3. Config JSON

This section describes the config JSON sent to the streaming endpoint via the NestRequest object generated by nest_pb2 in protoc. The config JSON must be sent during the first call to the real-time streaming recognition API.
The config JSON provides the following fields.

  • transcription: Set speech recognition language
  • keywordBoosting: Set to boost the recognition rate for entered words
  • forbidden: Set banned words
  • semanticEpd: Set criteria for generating speech recognition results
  • translationEpd: Set target language, response reception method, etc. during translation

Request body

The following describes the request body for the config JSON.

Transcription

The following describes Transcription.

Field Type Required Description
language String Required Language code for speech recognition target
  • ko | en | ja
    • ko: Korean
    • en: English
    • ja: Japanese
Note

Transcription is not a required input field, but we recommend setting it up for clear speech recognition.

Keyword Boosting

The following describes the Keyword Boosting field.

Field Type Required Description
keywordBoosting Object Optional Keyword boosting information
  • Increase the recognition rate for pre-registered keywords.
keywordBoosting.boostings Array Optional Keyword boosting word details: boostings

boostings

The following describes boostings.

Field Type Required Description
words String Optional Keyword boosting word list
  • When multiple entries are provided, separate them with commas (,).
    • e.g., "words": "test,test1,test2"
  • Include spaces before and after the word.
weight Float Optional Keyword boosting weight
  • 0-5.0
    • When the weight is 0, boosting is not applied.
  • All keywords have the same weight.

Forbidden

The following describes the Forbidden field.

Field Type Required Description
forbidden Object Optional Banned word information
  • Decrease the recognition rate for pre-registered keywords.
forbidden.forbiddens String Optional Banned word list
  • When multiple entries are provided, separate them with commas (,).
    • e.g., "forbiddens": "Banned word 1, banned word 2"
  • Include spaces before and after the word.
  • Banned word tag: <forbidden>Banned word</forbidden>
    • Add only to the value of the text key in the recognition result.
    • Added tags have no effect on the recognition results' position, periodPosition, and alignInfo.

SemanticEPD

The following describes the Semantic EPD field.

Field Type Required Description
semanticEpd Object Optional Generation criteria settings information for the speech recognition results
semanticEpd.skipEmptyText Boolean Optional Whether to transmit results with no recognition output. If this setting is set to true, results with no recognized syllables will not be transmitted.
  • true | false (default)
    • true: Do not transmit
    • false: Transmit
semanticEpd.useWordEpd Boolean Optional Whether to generate recognition results ending with a word. Setting this to true generates recognition results that end with a word.
  • true | false (default)
    • true: Generate
    • false: Do not generate
semanticEpd.usePeriodEpd Boolean Optional Whether to generate recognition results ending with punctuation. Setting this to true generates recognition results ending with punctuation.
  • true | false (default)
    • true: Generate
    • false: Do not generate
  • To improve punctuation recognition accuracy, when usePeriodEpd is true, also set useWordEpd to true.
semanticEpd.gapThreshold Integer Optional Silence duration threshold (ms, milliseconds) for generating recognition results. Recognition results are generated when silence exceeding gapThreshold occurs.
  • The default value is 0. It is unused if the user does not set it or sets a value less than or equal to 0. It can be set in milliseconds.
semanticEpd.durationThreshold Integer Optional Duration threshold (ms, milliseconds) for generating recognition results. Generate recognition results so that the duration is less than the durationThreshold value.
  • The default value is 0. If the user does not set it separately or sets a value less than or equal to 0, the default value is used. We recommend setting it directly in milliseconds to generate recognition results of an appropriate length.
semanticEpd.syllableThreshold Integer Optional Number of syllables used to generate recognition results. Generate recognition results such that the number of syllables composing them is less than the syllableThreshold value.
  • Spaces (" ") and periods (".") are also treated as one syllable.
  • The default value is 0. This setting is unused if the user does not set it or sets a value of 0 or lower.

Translation

The following describes the Translation field.

Field Type Required Description
translation.targets string Required Enter the language code for the language you want to translate.
translation.mergedResult Boolean Optional Setting to receive recognition results and translation results as a single response
  • true | false (default)
  • When set to true, translation results are generated using the EPD settings within the semanticEPD that produces the recognition results.
    • In this case, certain settings in the Translation EPD (usePeriodEpd, gapThreshold, durationThreshold, syllableThreshold) are ignored and do not affect the generation of translation results.
translation.gapThreshold Integer Optional Silence duration threshold (ms, milliseconds) for generating recognition results. Recognition results are generated when silence exceeding gapThreshold occurs.
  • The default value is 2000. It is unused if the user does not set it or sets a value less than or equal to 0. It can be set in milliseconds.
translation.durationThreshold Integer Optional Duration threshold (ms, milliseconds) for generating recognition results. Generate recognition results so that the duration is less than the durationThreshold value.
  • The default value is 20000. If the user sets a value less than or equal to 0, the default value is used. We recommend setting it directly in milliseconds to generate recognition results of an appropriate length.
translation.honorific Boolean Optional Whether to apply honorifics
  • true | false (default)
    • true: Apply honorifics
    • false: No honorifics
  • English ⇒ Korean, Japanese ⇒ Korean, Chinese (Simplified/Traditional) ⇒ Korean, Korean ⇒ Japanese, English ⇒ Japanese, Chinese (Simplified/Traditional) ⇒ Japanese
translation.glossaryKey String Optional Glossary ID
  • Apply substitution translation based on glossary data.
  • Korean ⇔ English, Japanese, Chinese (Simplified/Traditional), French | English ⇔ Japanese, Chinese (Simplified/Traditional), Vietnamese, Thai, Indonesian, French | Japanese ⇔ Chinese (Simplified/Traditional)
  • honorific is not applied to terms within the glossary.

Request example

The following is a sample request for the config JSON.

#Semantic EPD    
 {
  "semanticEpd": {
    "skipEmptyText": false,           
    "useWordEpd": false,              
    "usePeriodEpd": true,             
    "gapThreshold": 2000,             
    "durationThreshold": 20000,       
    "syllableThreshold": 0            
  }
}
#Translation Info / EPD
 {
  "translation": {
    "targets": ["en"],         
    "mergedResult": False,
    "gapThreshold": 2000,      
    "durationThreshold": 20000,
    "honorific": False,      
    "glossaryKey": string
            }
         }    
#KeywordBoosting
  {
  "keywordBoosting": {                  
    "boostings": [
      {
        "words": "test,test1,test2",
        "weight": 1
      },
      {
        "words": "Test, test 1, test 2",
        "weight": 0.5
      }
    ],
  },
#Forbidden
    "forbidden": {
    "forbiddens":  "Banned word 1, banned word 2",
  }
}

Response body

The following describes the response body for the config JSON.

Field Type Required Description
uid String - UID
responseType Array<String> - Response type
  • transcription | keywordBoosting | Forbidden | semanticEpd
config Object - Config JSON information
config.status String - Config JSON request status
  • Success | Failure | ${message}
    • Success: Request successful (gRPC configuration saved successfully)
    • Failure: Request failure (See Error message.)
    • ${message}: top_level_key can be omitted.
config.keywordBoosting Object - Keyword boosting information
config.keywordBoosting.status String - Keyword boosting request status
  • Success | Failure | ${message}
    • Success: Request successful (gRPC configuration saved successfully)
    • Failure: Request failure (See Error message.)
    • ${message}: top_level_key can be omitted.
config.forbidden Object - Sensitive keyword information
config.forbidden.status String - Sensitive keyword request status
  • Success | Failure | ${message}
    • Success: Request successful (gRPC configuration saved successfully)
    • Failure: Request failure (See Error message.)
    • ${message}: top_level_key can be omitted.
config.semanticEpd Object - Semantic EPD information
config.semanticEpd.status String - Semantic EPD request status
  • Success | Failure | ${message}
    • Success: Request successful (gRPC configuration saved successfully)
    • Failure: Request failure (See Error message.)
    • ${message}: top_level_key can be omitted.

Error message

The following describes the error messages displayed when a request fails.

Error message Related field Description
Unknown key: ${top_level_key}-${unknown_key} Common Unsupported sub-level key
Invalid type: ${top_level_key}-${invalid_type_key} Common Unsupported sub-level value type
Invalid language code: ${invalid_language_code} transcription language not predefined
Not Authorized transcription language not authorized
Targets are empty translation When targets are not set in the config request JSON
Invalid language code: ${source}:${targets} translation When the language code is unsupported or the source and target are identical
Internal system error keywordBoosting Internal server system error
Invalid request json format - Abnormal JSON format
Required key is not provided - Mandatory key value defined by the server missing
No more slot - No available resources on the current server
ConfigRequest did not complete - Config JSON request processing incomplete when server recognition request was made
Lifespan expired - gRPC service usage time expired
  • Threshold: 100 hours
Failed to received request msg - Server failed to properly receive the request message
Model server is not working - Internal server error
Internal server error - Internal server error
RESOURCE_EXHAUSTED - No available gRPC connection resources
  • Threshold: Exceeding 15 per domain

Response example

The following is a sample response for the config JSON.

Succeeded

The following is a sample response upon a successful call.

{
  "uid": "{uid}",
  "responseType": [ "config" ],
  "config": {
    "status": "Success",
    "keywordBoosting": {
      "status": "Success"
    },
    "forbidden" : {
      "status": "Success"
    }
}

Failure

The following is a sample response when the call fails upon entering hobidden in the request.

{
  "uid": "{uid}",
  "responseType": [ "config" ]
  "config": {
    "status": "Unknown key: hobidden"
  }
}

4. Recognize

After configuring the desired settings via the config JSON, call the speech recognition API using recognize to process and recognize speech data in real time. The NestRequest and authorization metadata in the code generated by the protoc will call the speech recognition API through the recognize method of the stub.

Request body

The following describes the request body for Recognize.

Field Type Required Description
epFlag Boolean Optional Buffer and result return timing upon pause or last recognition request
  • true | false (default)
    • true: Immediately return buffer after recognition request, then return result.
    • false: Automatically return buffer after 10 seconds with no additional requests, then return result.
seqId Integer Conditional Recognition request ID
  • It is used to check results when epFlag is set to true.
  • Result value is 0 if seqId is set and not transmitted.
    • We recommend setting and transmitting it with a value other than 0.

Response body

The following describes the response body for Recognize.

Field Type Required Description
uid String - UID
responseType Array - Response type
  • transcription | keywordBoosting | Forbidden | semanticEpd | recognize
config Object - Config JSON field information
config.text String - Recognition result text
config.position Integer - The position of the text received as text in the entire text
config.periodPositions Array<Integer> - The position of . (punctuation) in the entire text
  • Space if there is no . in text
config.periodAlignIndices Array<Integer> - The index alignInfos information of .
  • Space if there is no . in text
config.epFlag Boolean Optional Whether to include recognition results for the audio sent with epFlag set to true in the request
  • true | false
    • true: Include
    • false: Do not include
config.seqId Integer - Whether it is the last recognition request
  • true | false
    • true: Return the seqId of the last recognition request.
    • false: Return 0.
config.epdType String - EPD criteria for generating recognition results
  • gap | endPoint | durationThreshold | period | syllableThreshold | unvoice
    • gap: Silent
    • endPoint: Last speech data segment
    • durationThreshold: Playback duration
    • period: Punctuation
    • syllableThreshold: Number of syllables
    • unvoice: Run unvoiceTime (server setting).
config.startTimestamp Integer - Recognition result start time (ms)
config.endTimestamp Integer - Recognition result end time (ms)
config.confidence Float - Recognition result confidence
  • Geometric mean of all syllable confidence values (alignInfos.confidence) in the recognition result
config.alignInfos Array - Align information for syllables in the recognition result: aligninfos
recognize Object - Recognize information
recognize.status String - Recognize status
recognize.epFlag Object - epFlag information
recognize.epFlag.status String - epFlag status
  • If the extraContents in the recognize request JSON is an invalid format, the failure details are displayed in epFlag.status or seqId.status.
  • See Error message for response failures.
recognize.seqId Object - seqId information
recognize.seqId.status String - seqId status
  • If the extraContents in the recognize request JSON is an invalid format, the failure details are displayed in epFlag.status or seqId.status.
  • See Error message for response failures.
Note

An example of constructing full text using text and position is as follows:

Received order Recognition result full text
1 {text: "ABC", position: 0, ...} "ABC"
2 {text: "DEFG", position: 3, ...} "ABCDEFG"

alignInfos

The following describes alignInfos.

Field Type Required Description
word String - Composition syllables
start Integer - Composition syllable start time (ms)
end Integer - Composition syllable end time (ms)
confidence Float - Composition syllable confidence
  • 0-1.0

Error message

The following describes the error messages displayed when a Recognize request fails.

Error message Related field Description
Invalid Type recognize.status epFlag or seqId type does not match predefined type.
Required key is not provided recognize.status epFlag value in extraContents not provided
Invalid request json format recognize.status extraContents is not in JSON format.
Unknown key recognize.status A key not defined in the protocol written in extraContents
ConfigRequest is already called recognize.status Duplicate config request to server
Lifespan expired recognize.status gRPC service usage time expired
  • Threshold: 100 hours
Failed to received request msg recognize.status Server failed to properly receive the request message
Model server is not working recognize.status Internal server system error
Internal server error recognize.status Internal server system error
Failed to translation: ${message} - Translation feature-related error
Invalid format recognize.status The transmitted audio format is invalid.
Not found epFlag.status epFlag value not entered
Invalid type epFlag.status, seqId.status Predefined type mismatch
Invalid format audio.status Predefined type in the audio field mismatch

Response example

The response example is as follows:

Succeeded

The following is a sample response upon a successful call.

  • When the response is successful and "responseType": [ "transcription" ]
{
  "uid": "{uid}"
  "responseType": [ "transcription" ]
  "transcription": {
    "text": "This is text.",
    "position": 0,
    "periodPositions": [3],
    "periodAlignIndices": [3],
    "epFlag": false,
    "seqId": 0,
    "epdType": "durationThreshold",
    "startTimestamp": 190,
    "endTimestamp": 840,
    "confidence": 0.997389124199423,
    "alignInfos": [
      {"word":"This","start":190,"end":340,"confidence":0.9988637124943075},
      {"word":"is","start":341,"end":447,"confidence":0.9990018488549978},
      {"word":"text","start":448,"end":580,"confidence":0.9912501264550316},
      {"word":".","start":581,"end":700,"confidence":0.9994397226648595},
      {"word":" ","start":701,"end":840,"confidence":0.9984142043105126}
    ]
  }
}

Failure

The following is a sample response upon a failed call.

{
  "uid": string,                     # required
  "responseType": [ "recognize" ],    # required
  "recognize": {                     # required
    "status": string,                # required
    "epFlag": {                      # optional
      "status": string
    },
    "seqId": {                       # optional
      "status": string
    },
    "audio": {                       # op
      "status": 
    }
  }
}

5. Other API

Get the number of currently active stub calls and the maximum allowed stub count for the domain.

curl --location 'https://clovaspeech-gw.ncloud.com:50051/api/v1/${domainId}/active-calls' \
--header 'Authorization: Bearer ${API_KEY}'

Response

This section describes the response format.

Response body

The response body includes the following data:

Field Type Required Description
activeCalls Integer - Number of currently active stub calls
  • 0-999
maxCalls Integer - Maximum number of stubs allowed within the domain
  • 0-999
timestamp String - Data creation time
  • ISO 8601 format

Response example

{
    "activeCalls": 3,
    "maxCalls": 15,
    "timestamp": "2025-04-25T12:09:27.382+09:00"
}