Python Client for Cloud Speech API#

beta pypi versions

The Cloud Speech API enables developers to convert audio to text by applying powerful neural network models. The API recognizes over 80 languages and variants, to support your global user base.

Quick Start#

In order to use this library, you first need to go through the following steps:

  1. Select or create a Cloud Platform project.
  2. Enable billing for your project.
  3. Enable the Cloud Speech API.
  4. Setup Authentication.

Installation#

Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.

With virtualenv, it’s possible to install this library without needing system install permissions, and without clashing with the installed system dependencies.

Supported Python Versions#

Python >= 3.4

Deprecated Python Versions#

Python == 2.7. Python 2.7 support will be removed on January 1, 2020.

Mac/Linux#

pip install virtualenv
virtualenv <your-env>
source <your-env>/bin/activate
<your-env>/bin/pip install google-cloud-speech

Windows#

pip install virtualenv
virtualenv <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install google-cloud-speech

Example Usage#

from google.cloud import speech_v1
from google.cloud.speech_v1 import enums

client = speech_v1.SpeechClient()

encoding = enums.RecognitionConfig.AudioEncoding.FLAC
sample_rate_hertz = 44100
language_code = 'en-US'
config = {'encoding': encoding, 'sample_rate_hertz': sample_rate_hertz, 'language_code': language_code}
uri = 'gs://bucket_name/file_name.flac'
audio = {'uri': uri}

response = client.recognize(config, audio)

Next Steps#

Using the Library#

Asynchronous Recognition#

The long_running_recognize() method sends audio data to the Speech API and initiates a Long Running Operation.

Using this operation, you can periodically poll for recognition results. Use asynchronous requests for audio data of any duration up to 80 minutes.

See: Speech Asynchronous Recognize

>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> audio = speech.types.RecognitionAudio(
...     uri='gs://my-bucket/recording.flac')
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100)
>>> operation = client.long_running_recognize(config=config, audio=audio)
>>> op_result = operation.result()
>>> for result in op_result.results:
...     for alternative in result.alternatives:
...         print('=' * 20)
...         print(alternative.transcript)
...         print(alternative.confidence)
====================
'how old is the Brooklyn Bridge'
0.98267895

Synchronous Recognition#

The recognize() method converts speech data to text and returns alternative text transcriptions.

This example uses language_code='en-GB' to better recognize a dialect from Great Britain.

>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> audio = speech.types.RecognitionAudio(
...     uri='gs://my-bucket/recording.flac')
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100)
>>> results = client.recognize(config=config, audio=audio)
>>> for result in results:
...     for alternative in result.alternatives:
...         print('=' * 20)
...         print('transcript: ' + alternative.transcript)
...         print('confidence: ' + str(alternative.confidence))
====================
transcript: Hello, this is a test
confidence: 0.81
====================
transcript: Hello, this is one test
confidence: 0

Example of using the profanity filter.

>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> audio = speech.types.RecognitionAudio(
...     uri='gs://my-bucket/recording.flac')
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100,
...     profanity_filter=True)
>>> results = client.recognize(config=config, audio=audio)
>>> for result in results:
...     for alternative in result.alternatives:
...         print('=' * 20)
...         print('transcript: ' + alternative.transcript)
...         print('confidence: ' + str(alternative.confidence))
====================
transcript: Hello, this is a f****** test
confidence: 0.81

Using speech context hints to get better results. This can be used to improve the accuracy for specific words and phrases. This can also be used to add new words to the vocabulary of the recognizer.

>>> from google.cloud import speech
>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> audio = speech.types.RecognitionAudio(
...     uri='gs://my-bucket/recording.flac')
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100,
...     speech_contexts=[speech.types.SpeechContext(
...         phrases=['hi', 'good afternoon'],
...     )])
>>> results = client.recognize(config=config, audio=audio)
>>> for result in results:
...     for alternative in result.alternatives:
...         print('=' * 20)
...         print('transcript: ' + alternative.transcript)
...         print('confidence: ' + str(alternative.confidence))
====================
transcript: Hello, this is a test
confidence: 0.81

Streaming Recognition#

The streaming_recognize() method converts speech data to possible text alternatives on the fly.

Note

Streaming recognition requests are limited to 1 minute of audio.

See: https://cloud.google.com/speech/limits#content

>>> import io
>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100,
... )
>>> with io.open('./hello.wav', 'rb') as stream:
...     requests = [speech.types.StreamingRecognizeRequest(
...         audio_content=stream.read(),
...     )]
>>> results = sample.streaming_recognize(
...     config=speech.types.StreamingRecognitionConfig(config=config),
...     requests,
... )
>>> for result in results:
...     for alternative in result.alternatives:
...         print('=' * 20)
...         print('transcript: ' + alternative.transcript)
...         print('confidence: ' + str(alternative.confidence))
====================
transcript: hello thank you for using Google Cloud platform
confidence: 0.927983105183

By default the API will perform continuous recognition (continuing to process audio even if the speaker in the audio pauses speaking) until the client closes the output stream or until the maximum time limit has been reached.

If you only want to recognize a single utterance you can set single_utterance to True and only one result will be returned.

See: Single Utterance

>>> import io
>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100,
... )
>>> with io.open('./hello-pause-goodbye.wav', 'rb') as stream:
...     requests = [speech.types.StreamingRecognizeRequest(
...         audio_content=stream.read(),
...     )]
>>> results = sample.streaming_recognize(
...     config=speech.types.StreamingRecognitionConfig(
...         config=config,
...         single_utterance=False,
...     ),
...     requests,
... )
>>> for result in results:
...     for alternative in result.alternatives:
...         print('=' * 20)
...         print('transcript: ' + alternative.transcript)
...         print('confidence: ' + str(alternative.confidence))
...     for result in results:
...         for alternative in result.alternatives:
...             print('=' * 20)
...             print('transcript: ' + alternative.transcript)
...             print('confidence: ' + str(alternative.confidence))
====================
transcript: testing a pause
confidence: 0.933770477772

If interim_results is set to True, interim results (tentative hypotheses) may be returned as they become available.

>>> import io
>>> from google.cloud import speech
>>> client = speech.SpeechClient()
>>> config = speech.types.RecognitionConfig(
...     encoding=speech.enums.RecognitionConfig.AudioEncoding.LINEAR16,
...     language_code='en-US',
...     sample_rate_hertz=44100,
... )
>>> with io.open('./hello.wav', 'rb') as stream:
...     requests = [speech.types.StreamingRecognizeRequest(
...         audio_content=stream.read(),
...     )]
>>> config = speech.types.StreamingRecognitionConfig(config=config)
>>> responses = client.streaming_recognize(config,requests)
>>> for response in responses:
...     for result in response:
...         for alternative in result.alternatives:
...             print('=' * 20)
...             print('transcript: ' + alternative.transcript)
...             print('confidence: ' + str(alternative.confidence))
...             print('is_final:' + str(result.is_final))
====================
'he'
None
False
====================
'hell'
None
False
====================
'hello'
0.973458576
True

API Reference#

A new beta release, spelled v1p1beta1, is provided to provide for preview of upcoming features. In order to use this, you will want to import from google.cloud.speech_v1p1beta1 in lieu of google.cloud.speech.

An API and type reference is provided the first beta also:

Changelog#

For a list of all google-cloud-speech releases: