Speech and Natural Language Input for Your Mobile App Using LLMs | by Hans van Dam | Jul, 2023


It shows a stripped version of the function templates as added to the prompt for the LLM. To see the full length prompt for the user message: ‘What things can I do in Amsterdam?’, click here (Github Gist). It contains a full curl request that you can use from the command line or import into postman. You need to put your own OpenAI-key in the placeholder to run it.

Some screens in your app don’t have any parameters, or at least not the ones that the LLM needs to be aware of. In order to reduce token usage and clutter we can combine a number of these screen triggers in a single function with one parameter: the screen to open

{
"name": "show_screen",
"description": "Determine which screen the user wants to see",
"parameters": {
"type": "object",
"properties": {
"screen_to_show": {
"description": "type of screen to show. Either
'account': 'all personal data of the user',
'settings': 'if the user wants to change the settings of
the app'",
"enum": [
"account",
"settings"
],
"type": "string"
}
},
"required": [
"screen_to_show"
]
}
},

The Criterion as to whether a triggering function needs parameters is whether the user has a choice: there is some form of search or navigation going on on the screen, i.e. are there any search (like) fields or tabs to choose from.

If not, then the LLM does not need to know about it, and screen triggering may be added to the generic screen triggering function of your app. It is mostly a matter of experimentation with the descriptions of the screen purpose. If you need a longer description, you may consider giving it its own function definition, to put more separate emphasis on its description than the enum of the generic parameter does.

In the system message of your prompt you give generic steering information. In our example it can be important for the LLM to know what date and time it is now, for instance if you want to plan a trip for tomorrow. Another important thing is to steer its presumptiveness. Often we would rather have the LLM be overconfident than bother the user with its uncertainty. A good system message for our example app is:

"messages": [
{
"role": "system",
"content": "The current date and time is 2023-07-13T08:21:16+02:00.
Be very presumptive when guessing the values of
function parameters."
},

Function parameter descriptions can require quite a bit of tuning. An example is the trip_date_time when planning a train trip. A reasonable parameter description is:

"trip_date_time": {
"description": "Requested DateTime for the departure or arrival of the
trip in 'YYYY-MM-DDTHH:MM:SS+02:00' format.
The user will use a time in a 12 hour system, make an
intelligent guess about what the user is most likely to
mean in terms of a 24 hour system, e.g. not planning
for the past.",
"type": "string"
},

So if it is now 15:00 and users say they wants to leave at 8, they mean 20:00 unless they mention the time of the day specifically. The above instruction works reasonably well for GPT-4. But in some edge cases it still fails. We can then e.g. add extra parameters to the function template that we can use to make further repairs in our own code. For instance we can add:

"explicit_day_part_reference": {
"description": "Always prefer None! None if the request refers to
the current day, otherwise the part of the day the
request refers to."
"enum": ["none", "morning", "afternoon", "evening", "night"],
}

In your app you are likely going to find parameters that require post-processing to enhance their success ratio.

Sometimes the user’s request lacks information to proceed. There may not be a function suitable to handle the user’s request. In that case the LLM will respond in natural language that you can show to the user, e.g. by means of a Toast.

It may also be the case that the LLM does recognize a potential function to call, but information is lacking to fill all required function parameters. In that case consider making parameters optional, if possible. But if that is not possible, the LLM may send a request, in natural language, for the missing parameters, in the language of the user. You should show this text to the users, e.g. through a Toast or text-to-speech, so they can give the missing information (in speech). For instance when the user says ‘I want to go to Amsterdam’ (and your app has not provided a default or current location through the system message) the LLM might respond with ‘I understand you want to make a train trip, from where do you want to depart?’.

This brings up the issue of conversational history. I recommend you always include the last 4 messages from the user in the prompt, so a request for information can be spread over multiple turns. To simplify things, simply omit the system’s responses from the history, because in this use case they tend to do more harm than good.

Speech recognition is a crucial part in the transformation from speech to a parametrized navigation action in the app. When the quality of interpretation is high, bad speech recognition may very well be the weakest link. Mobile phones have on-board speech recognition, with reasonable quality, but LLM based speech recognition like Whisper, Google Chirp/USM, Meta MMS or DeepGram tends to lead to better results.

It is probably best to store the function definitions on the server, but they can also be managed by the app and sent with every request. Both have their pros and cons. Having them sent with every request is more flexible and the alignment of functions and screens may be easier to maintain. However, the function templates not only contain the function name and parameters, but also their descriptions that we might want to update quicker than the update flow in the app stores. These descriptions are more or less LLM-dependent and crafted for what works. It is not unlikely that you want to swap out the LLM for a better or cheaper one, or even swap dynamically at some point. Having the function templates on the server may also have the advantage of maintaining them in one place if your app is native on iOS and Android. If you use OpenAI services for both speech recognition and natural language processing, the technical big picture of the flow looks as follows:

architecture for speech enabling your mobile app using Whisper and OpenAI function calling

The users speak their request, it is recorded into an m4a buffer/file (or mp3 if you like), which is sent to your server, which relays it to Whisper. Whisper responds with the transcription, and your server combines it with your system message and function templates into a prompt for the LLM. Your server receives back the raw function call JSON, which it then processes into a function call JSON object for you app.

To illustrate how a function call translates into a deep link we take the function call response from the initial example:

"function_call": {
"name": "outings",
"arguments": "{\n \"area\": \"Amsterdam\"\n}"
}

On different platforms this is handled quite differently, and over time many different navigation mechanisms have been used, and are often still in use. It is beyond the scope of this article to go into implementation details, but roughly speaking the platforms in their most recent incarnation can employ deep linking as follows:

On Android:

navController.navigate("outings/?area=Amsterdam")

On Flutter:

Navigator.pushNamed(
context,
'/outings',
arguments: ScreenArguments(
area: 'Amsterdam',
),
);

On iOS things are a little less standardized, but using NavigationStack:

NavigationStack(path: $router.path) {
...
}

And then issuing:

router.path.append("outing?area=Amsterdam")

More on deep linking can be found here: for Android, for Flutter, for iOS

There are two modes of free text input: voice and typing. We’ve mainly talked about speech, but a text field for typing input is also an option. Natural language is usually quite lengthy, so it may be difficult to compete with GUI interaction. However, GPT-4 tends to be quite good at guessing parameters from abbreviations, so even very short abbreviated typing can often be interpreted correctly.

The use of functions with parameters in the prompt often dramatically narrows the interpretation context for an LLM. Therefore it needs very little, and even less if you instruct it to be presumptive. This is a new phenomenon that holds promise for mobile interaction. In case of the train station to train station planner the LLM made the following interpretations when used with the exemplary prompt structure in this article. You can try it out for yourself using the prompt gist mentioned above.

Examples:

‘ams utr’: show me a list of train itineraries from Amsterdam central station to Utrecht central station departing from now

‘utr ams arr 9’: (Given that it is 13:00 at the moment). Show me a list of train itineraries from Utrecht Central Station to Amsterdam Central Station arriving before 21:00

Follow up interaction

Just like in ChatGPT you can refine your query if you send a short piece of the interaction history along:

Using the a history feature the following also works very well (presume it is 9:00 in the morning now):

Type: ‘ams utr’ and get the answer as above. Then type ‘arr 7’ in the next turn. And yes, it can actually translate that into a trip being planned from Amsterdam Central to Utrecht Central arriving before 19:00.
I made an example web app about this that you find a video about here. The link to the actual app is in the description.

You can expect this deep link structure to handle functions within your app to become an integral part of your phone’s OS (Android or iOS). A global assistant on the phone will handle speech requests, and apps can expose their functions to the OS, so they can be triggered in a deep linking fashion. This parallels how plugins are made available for ChatGPT. Obviously, now a coarse form of this is already available through the intents in the AndroidManifest and App Actions on Android and on iOS though SiriKit intents. The amount of control you have over these is limited, and the user has to speak like a robot to activate them reliably. Undoubtedly this will improve over time.

VR and AR (XR) offers great opportunities for speech recognition, because the users hands are often engaged in other activities.

It will probably not take long before anyone can run their own high quality LLM. Cost will decrease and speed will increase rapidly over the next year. Soon LoRA LLMs will become available on smartphones, so inference can take place on your phone, reducing cost and speed. Also more and more competition will come, both open source like Llama2, and closed source like PaLM.

Finally the synergy of modalities can be driven further than providing random access to the GUI of your entire app. It is the power of LLMs to combine multiple sources, that hold the promise for better assistance to emerge. Some interesting articles: multimodal dialog, google blog on GUIs and LLMs, interpreting GUI interaction as language.

In this article you learned how to apply function calling to speech enable your app. Using the provided Gist as a point of departure you can experiment in postman or from the command line to get an idea of how powerful function calling is. If you want to run a POC on speech enabling your app, I would recommend putting the server bit, from the architecture section, directly into your app. It all boils down to 2 http calls, some prompt construction and implementing microphone recording. Depending on your skill and codebase, you will have your POC up and running in several days.

Happy coding!

Follow me on LinkedIn

All images in this article, unless otherwise noted, are by the author



Source link

Leave a Comment