Next Generation Audio for smart speakers

“Hey Alexa/Siri/Google, change the program language to English and increase the dialogue level.”

The widespread use of smart speakers offers a great potential for many Next Generation Audio (NGA)-related applications. Smart speakers by itself offer a broad set of possibilities to gain information or to interact with the voice assistant, but also to listen to audio. In view of these possibilities, clearly many applications can be derived with regard to NGA. This could be for example to control the dialogue volume or the dialogue language of an audio production by a voice command or even to change the narrative perspective of an audio drama during playback, for example by saying:

“Hey Alexa/Siri/Google, change to the police officer’s perspective.”

The possibilities are certainly diverse.

However, smart speakers do not natively support NGA yet. An exception is the Amazon Echo Studio, which can play back music produced in MPEG-H 3D Audio through Amazon Music. But this does not include features like personalization or interactivity. Also, MPEG-H is not supported in third-party skills yet.

Unfortunately, the development possibilities for custom media skills are in general very limited as well. In most cases it is only possible to create a playlist that can be played back using simple navigation commands. However, a more complex navigation between different audio tracks is especially important for NGA, as it allows switching between different languages during playback, for example. Our attempt to implement NGA for smart speakers thus failed with all current devices. Even the attempt to achieve a comparable result through a workaround unfortunately failed. Though it was possible to switch between different audio tracks, the systems did not offer sufficient possibilities to evaluate current playback positions. The reason for this is the closed development-system of these devices, which only allows the development of skills by using pre-defined functionalities, e.g. for media-playback. This makes the implementation of NGA completely dependent on the manufacturer.

The following figure shows the current process of launching a skill with media playback. Most of the processing takes place on the server. The smart speaker thereby only serves as an interface and playback system.

Flowchart of a skill with media playback

Nevertheless, we are sure that the implementation of NGA in smart speakers can also provide a multitude of interesting features and look forward to see further development of NGA for smart speakers from the manufacturers, or enhanced toolkits for third parties, soon.

Back to blog