A prompt command to combine the output and input into a single dialplan
application invocation (MRCPRecog() for native file playback,
SynthAndRecog() for TTS). This avoids the problem of multiple dialplan
applications blocking one another, but introduces a fresh one: these
applications terminate output as soon as recognition completes (or earlier
if barge-in is enabled). There is no opportunity to inject logic to filter
the recognition result prior to terminating the output, nor do I think this
would make sense.
The Asterisk Speech API (SpeechLoadGrammar(), SpeechActivateGrammar(),
SpeechStart(),SpeechBackground(), etc). If SpeechBackground() this would be
the obvious solution, but it unfortunately is not. SpeechBackground()
actually sits in a loop, directing audio frames to the recognizer while
simultaneously rendering frames of audio (the first option is a file path).
The app does not return until recognition has completed, so cannot be
combined with Playback(). Upon recognition completion, the output will be
terminated, regardless of the recognition result, so this suffers the same
problem as Rayo Prompt. It is also not possible to use any other output
renderer, such as a TTS engine via MRCP.
Can we implement Asterisk/Lumenvox CPA in way to be compatible with the
adhearsion-cpa controller methods API?
The problems stated above leave us with only one option: extra capability
must be introduced to Asterisk in order to handle simultaneous dialplan
applications, or to introduce a true async version of SpeechBackground().
The viability of this is something that must be discussed with the Asterisk
project / Digium. Note that FreeSWITCH already has this capability, but
would also need less invasive changes to cope with LumenVox CPA as stated
above; a far more approachable task.
A few thoughts here:
(1) I'm not sure that introducing a dialplan variant of
SpeechBackground that had some asynchronous capabilities will buy
much. At the end of the day, you're still stuck in the dialplan -
which has a synchronous model of operation. To do everything that you
need, you need:
(a) Asynchronous results from the speech engine
(b) Asynchronous capabilities to control media operations
(c) Asynchronous capabilities to control the speech recognition
While (b) does exist in the previously mentioned AMI action, we're now
once again requiring a combination of AGI/dialplan + AMI - which is
clunky. It's the reason why we wrote ARI in the first place!
(2) The good news is, the speech API in Asterisk is not synchronous.
The current APIs that expose it certainly are, but there is no
implicit long running blocking operation involved with
ast_speech_write (or any of the other C API functions involved in
res_speech). Building an asynchronous function that emits events
(similar to TALK_DETECT) or adding this as an explicit operation to an
ARI resource is not a very hard task. In fact, using audiohooks is a
fairly painless way of passing audio frames from a channel (regardless
of where they are) into ast_speech_write, and would be a simple way of
passing media into the speech engine in an asynchronous fashion.
(3) I think it'd be nice if this was a native operation in ARI. Unlike
TALK_DETECT - which is a relatively simple on/off use case - there's a
lot of subtlety to speech recognition. Some of the existing operations
(such as engine creation/enabling) could probably be hidden under an
operation on a channel resource, but the ability to activate certain
grammars while speech recognition is enabled on a channel would
certainly be nice. I'd imagine this would be somewhat similar to the
/play operation, where what you are handed back is a resource that has
some additional properties that can be manipulated independently.
Something like:
POST /channels/{id}/recognizeSpeech?speechId=12345&default_grammar=yes_no
A speech resource (maybe a different name? We typically use a plural
form for this - speechInstances?) could be used to manipulate an
active speech recognition process on a channel:
DELETE /speech/12345/ (stop speech recognition)
POST /speech/12345/grammar?name=moar_grammars
POST /speech/12345/parameter?name=engine_specific_property&value=foobar
Or other things along those lines.