this dir | view | cards | source | edit | dark
- credit requirements
- communication domains
- single/closed-domain
- multi-domain
- open-domain
- application areas
- phone
- apps
- smart speakers
- appliances
- cars
- web
- embodied (robots)
- virtual characters
- modes of communication
- text
- voice
- multimodal – video, mimics, touch, …
- dialogue initiative
- system-initiative
- user-initiative
- mixed-initiative
- traditional architecture
- main loop
- voice → text → meaning → reaction → text → voice
- components
- speech recognition
- language understading
- dialogue management
- has access to backend (in order to perform tasks)
- language generation
- speech synthesis
- multimodal system would have additional components
- automatic speech recognition (ASR)
- converting speech signal into text
- typically produces several possible hypotheses with confidence scores
- n-best list
- lattice
- confusion network
- very good in ideal conditions
- problems: noise, accents, distance, channel (phone), …
- voice activity detection
- is the user talking to the system?
- wake words (OK, Google)
- ASR is usually implemented using neural networks
- natural/spoken language understanding (NLU/SLU)
- extracting the meaning from the user utterance
- converting into a structured semantic representation
- dialogue acts
- act type/intent (inform, request, confirm)
- slot/attribute
- value
- examples
- inform(food=Chinese, price=cheap)
- request(address)
- can be more complex (using syntax trees, predicate logic)
- specific steps
- named entity recognition
- coreference resolution
- implementation varies
- handcrafting often works for limited domains
- keyword spotting, regular expressions, handcrafted grammars
- machine learning approaches
- can also provide n-best outputs
- problems
- recovering from bad ASR
- ambiguities – next Friday (it is Tuesday now)
- variation – there are many ways to express the same thing
- dialogue manager (DM)
- stores dialogue history modeled by dialogue state
- handcrafted × probabilistic
- handcrafted … just replace the value in the slot by the last-mentioned
- probabilistic … keep an estimate
- system actions described by dialogue policy
- decision on next system action, given dialogue state
- involves backend queries
- result represented as system dialogue act
- handcrafted
- if-then-else clauses
- flowcharts
- machine learning
- often trained with reinforcement learning
- POMDP (partially observable markov decision process)
- recurrent neural networks
- natural language generation (NLG)
- how to express things might depend on context
- goals: fluency, naturalness, avoid repetition, …
- traditional approach: templates
- fill in values into predefined templates (sentence skeletons)
- works well for limited domains
- grammar-based approaches
- grammar/semantic structures
- syntactic transformation rules are applied
- statistical approaches
- most prominent: transformer neural networks
- generating word-by-word
- speech synthesis
- standard pipeline: text normalization, pronunciation analysis, intonation/stress generation, waveform synthesis
- TTS methods
- formant-based – phoneme-specific frequencies, rules
- concatenative – record a single person, cut into phoneme transitions
- hidden Markov models
- neural networks
- no need for phoneme conversion, can go directly from text
- text to spectrograms → vocoder (spectrogram to audio)
- organizing the components
- basic – pipeline
- components oblivious of each other
- interconnected
- read/write changes to dialogue state
- more reactive but more complex
- joining the modules
- NLU + state tracking
- NLU & DM & NLG – using LLMs, may be end-to-end (without module separation)
- audio based end-to-end (audio-to-audio)
- research areas
- LLM-based systems
- dialogue flows from data – finding patterns in human dialogue recordings/transcripts
- multimodality – adding video (input/output)
- context dependency – understand/reply in context (grounding, speaker adaptation)
- incrementality – don't wait for the whole sentence to start processing