Yes, speech to text is a great potential feature of Dynamics 365 CE. I just did the PoC and it was amazing to see how voice memos can be converted to text using Azure cognitive services in real time. Of-course, it has its own learning curve to be one of the cool IPs, but it still works fine and suffice the requirement. It can also let users to choose capabilities like capitalization, punctuation, and normalization. It does support multiple languages.
Here is what you need to do to try it out.
Step 1: Spun a Dynamics 365 CE environment
Login to trials.dynamics.com to create a new environment which you can use for testing speech to text feature.
Once done, you will see login screen like below. Don’t be afraid if your screen is on UCI (Unified Client Interface). I am used to forced classic interface and hence using this.
Step 2: Install speech to text solution from AppSource
Once you are logged in to your Dynamics 365 CE instance. Navigate to the following URLin the same browser. Click Get it now.
Fill the form and click Continue
Select the organization and click Okay
You will now be able to see the solution in to your admin centre under manage your solution. In order to go to admin centre, follow the below steps
- Within the same browser where you are logged in to Dynamics 365 CE, open new tab and type in https://admin.microsoft.com/AdminPortal/Home#/homepage
- Click Show All on the left pane
- Click All admin centers
- Click Dynamics 365 and click Solutions
- You will see the following window with the status of your solution
Step 3: Cognitive services creation under Azure subscription
Within the same browser where you are logged in to Dynamics 365 CE, open new tab and type in portal.azure.com
Once logged in search for Cognitive Services in search bar
Click +Add and search for Speech to text in the search bar
Click Create and fill the form. You can create your own resource group. Select the location as per your geography. I have used South Central US for this example. Click create once done. This will deploy your services
Once deployed, copy your keys
Step 4: Speech to text configuration in Dynamics 365 CE
Now, come back to Dynamics 365 CE instance, click on Advance find (funnel) on top right corner. Look for ‘Speech to Text Global Configurations’ and click results
Click New Speech to Text Global Configuration and change the form to the second as shown in the screenshot below. Paste the key you have copied from Azure cognitive services. Define the location too. Because I have selected South Central US in Azure, I have defined southcentralus in Speech services region field. Depending on which region you are selecting in Azure, you have to put the same region in configuration. You can find the list of regions and endpoints here. Save the record.
Now go back to the advanced find and search for Speech to Entity Configuration and click results. This is important because, here I am going to define the entity and field which I will be using for Speech to text.
Click New Speech to Text Entity Configuration and change the form to the second as shown in the screenshot below. Define the schema name of entity and field that will be used for speech to text. I have used email entity and description field. Define the language code as per your region. I have used en-us
Step 5: Testing the functionality
Now that we are all set, lets test it out Speech to text functionality. Because I have set email entity and description field on email entity. Let’s navigate to email under activities and create a new email activity.
Navigate to Sales/Activities as shown below
Create email activity as shown below
Key in ‘To’ and ‘Subject’ details. Put your cursor on description field and Click SPEECH TO TEXT START and start recording your voice. It will convert it to text in real time.
Once done click SPEECH TO TEXT STOP as shown below
It’s just works great. Amazing to see voice converting in to real-time.
What if the sales persons are on the move and don’t have internet? Can they record offline voice messages and convert them in to texts later? Lets see how it is possible.
If we create a sound note on our mobile device, it will be attached as a .wav file. So we can use some kind of speech recognition system to convert this audio file into text. There are ways to do it. Microsoft cognitive services and google cloud services.
It has 30 days free trial and has a REST API which can convert up to 15 seconds long audio files (only .wav files, unfortunately). Google Speech API looks much more mature:
Really big number of available languages and no constraints on audio files length and many more available formats is something that looks really impressive and sets the bar high for Microsoft to catch up. There are also many other speech recognition systems, so simply choose what’s best for you. I’ve chosen Azure Cognitive Services for the sake of this PoC.
Having said all that, now we only have to write a plugin that will call our service and convert the audio file when annotation is created:
ServiceResponse class looks like this:
FromJSON is just an extension method that uses JSON.NET library to deserialize JSON string to my object (no I’m not merging JSON.NET library using IlMerge or any other merging technique which I try to avoid in most cases – I always attach the full source code of this great library in my plugins – recently almost every project I’m working on, requires some kind of serialization/deserialization inside plugins – mostly for HTTP API calls).
This plugin is registered on Annotation creation message. We just check, if annotation contains an attachment and if the type of attachment is “audio/wav” we call the Bing Speech API to convert this audio file to text. The response looks like this:
Believe me or not, I actually said “Hello world” when I was making the sound note. So basically the result says that it’s 92,5% sure that I said “Hello world” and 79% that I said “Hello”. There can be more matches of course – how we handle those matches is a different story. We can prepare some sort of machine learning algorithm, maybe use something that’s already there on Azure or maybe some neural network algorithm – whatever you like. I’m simply taking the match with the highest Confidence parameter, I guess that in many cases it will be wrong, but for a PoC it’s good enough for me.
And as a result we get our note with audio file and text in the note text: