Microsoft’s Project Oxford’s APIs are a bit more complicated to use than what many are expecting, especially when using Node.js! The Project Oxford SDK does not provide code in Node for all APIs, i.e Speech to Text. Using the documentation provided online for REST API’s can be a bit confusing, but I think I can help out a little. Note that this API takes in a pre-recorded wav file and listens for speech.
So your first step is getting your Access Token. You need to make a POST request to https://oxford-speech.cloudapp.net/token/issueToken . The body of your POST request must include the following as x-www-form-urlencoded data:
- grant_type: ‘client credentials’
- client_id: <whatever you would like to call it>
- client_secret: <your api key from the project oxford site>
- scope: ‘https://speech.platform.bing.com’
You should receive a JSON object with an access token, a token type, an expiration time (in seconds) and the scope.
From here your next step is to make another POST request to https://speech.platform.bing.com/recognize/query with the following parameters:
- version: 3.0
- requestid: <this can be any unique GUID>
- appID: D4D52672-91D7-4C74-8AD8-42B1D98141A5 (this is the magic value for this to work)
- format: json
- locale: en-US (or whichever language you prefer)
- device.os: <which ever device you are using>
- scenarios: ulm
- instanceid: <this can be any unique GUID>
You can get newly created GUIDs for your instanceid and requestid from various online sites or an npm module named guid made to randomly create them. The body of your post must be waveData. And your Headers are as follows:
- Authorization: Bearer<authorization token we received in the first post>
- Content-Type: audio/wav; samplerate=8000 (be sure the sample rate matches the wav file you are using)
Testing the API using an app like POSTMAN would look something like this:
and then adding the headers of course. If you are using a tool like request in your app.. then your POST request may look something like:
Note in this example the accessToken is a variable rather than the long accessToken given to you earlier. This should help make the code a lot cleaner.
For a good example on how to use this Project Oxford Speech to text check out this github gist by Luke Hoban. He does a great job of making all this of this into one seamless function.
Good luck 🙂