New Researches by Amazon to Improve Artificial Intelligence
In April, Amazon had published two papers describing two different systems. While one enabled better and novel approach to automatic speech recognition, other is about improving dialog state tracking using machine reading comprehension for state-of-the-art performance in contrast with existing models. These papers claim to improve the accuracy of Artificial intelligence in both business and consumer territories and reduce the amount of data needed to train them.
Paper 1: Automatic Speech Recognition
Generally, speech recognition models require large volumes of transcribed audio data for better performance ability. Luckily, the advent of semi-supervised learning methods has simplified this issue, as these methods use a smaller labeled set to train an initial seed model that is applied to a more substantial amount of unlabeled data to generate a hypothesis. Later the form of uncategorized data with the highest number of reliable hypotheses is added again for retraining of the system.
An Amazon team of researchers, further ventured into the last step using a framework that they refer to as deep contextualized acoustic representations (numerical sequences). These sequences has the capability to learn effective, context-aware acoustic representations by employing a huge amount of unlabeled/uncategorized data and then applies them to speech recognition tasks with a limited amount of labeled data. The Artificial Intelligence models further studies these representations using past and future information, plus predicts slices of acoustic feature representations during active speech processing.
The Amazon team held series of experiments where they performed tests on the open-source LibriSpeech data sets and a popular Wall Street Journal corpus. They used somewhere between 100 hours to 960 hours from LibriSpeech and 81 hours of labeled speech data from the Wall Street Journal for the purpose of model training. The models surpassed all baselines on LibriSpeech set while in comparison Wall Street Journal clocked a relative improvement of up to 42%.
Paper 2: Dialogue Tracking and Machine Reading Comprehension
Whilst developing an Artificial Intelligence assistant like Alexa, so that it understands requests and completes tasks, developers need a robust system a good dialogue state tracking system that tracks the state of dialogues in back-and-forth conversations. The state is described as a pair of variables, i.e., a slot and a slot value. This is important to define how speech and entity data is identified and controlled.
Since only a few training data sets are available for dialogue state tracking systems, a team of Alexa researchers turned to machine reading comprehension, to counter this issue. This is because reading comprehension is concerned with the general understanding of the text regardless of its format. When devised with this ability, dialog state tracking system tasks are promoted to the capacity of using the ample reading comprehension data available. This was not possible earlier as the dialogue trackers aimed at the contextual understanding of requests and the state from the conversation.
The Amazon team constructed a question for each slot in the dialogue state and then divided the slots into two types, depending on the number of slot values in the ontology viz. categorical and extractive. Afterwards, they build two machine reading comprehension models for dialogue state tracking: one that used multiple-choice reading comprehension models where an answer had to be chosen from a limited number of options (for categorical slots) and a second that applied a span-based reading comprehension where the answer can be found in the form of a span in the conversation (for extractive slots).
The Amazon team used MultiWoz (an open-source dialogue corpora containing 10,000 dialogues with annotated states across 7 distinct domains) for evaluation and fine-tuning of the slots. The categorical dialogue state tracking model trained on question-answer data sets (DREAM and RACE) and the extractive dialogue state tracking model trained on the broader MRQA. The researchers used 5 domains in total — “attraction,” “restaurant,” “taxi,” “train,” and “hotel”. The team attained 45.91% joint goal accuracy with around 1% (20-30 dialogues) of “hotel” domain, and 90% average slot accuracy in 12 out of 30 slots in MultiWoz in a zero-shot scenario.