A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations
Language processing in the brain presents a challenge due to its inherently complex, multidimensional, and context-dependent nature. Psycholinguists have attempted to construct well-defined symbolic features and processes for domains, such as phonemes for speech analysis and part-of-speech units for syntactic structures. Despite acknowledging some cross-domain interactions, research has focused on modeling each linguistic subfield in […] The post A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations appeared first on MarkTechPost.

Language processing in the brain presents a challenge due to its inherently complex, multidimensional, and context-dependent nature. Psycholinguists have attempted to construct well-defined symbolic features and processes for domains, such as phonemes for speech analysis and part-of-speech units for syntactic structures. Despite acknowledging some cross-domain interactions, research has focused on modeling each linguistic subfield in isolation through controlled experimental manipulations. This divide-and-conquer strategy shows limitations, as a significant gap has emerged between natural language processing and formal psycholinguistic theories. These models and theories struggle to capture the subtle, non-linear, context-dependent interactions occurring within and across levels of linguistic analysis.
Recent advances in LLMs have dramatically improved conversational language processing, summarization, and generation. These models excel in handling syntactic, semantic, and pragmatic properties of written text and in recognizing speech from acoustic recordings. Multimodal, end-to-end models represent a significant theoretical advancement over text-only models by providing a unified framework for transforming continuous auditory input into speech and word-level linguistic dimensions during natural conversations. Unlike traditional approaches, these deep acoustic-to-speech-to-language models shift to multidimensional vectorial representations where all elements of speech and language are embedded into continuous vectors across a population of simple computing units by optimizing straightforward objectives.
Researchers from Hebrew University, Google Research, Princeton University, Maastricht University, Massachusetts General Hospital and Harvard Medical School, New York University School of Medicine, and Harvard University have presented a unified computational framework that connects acoustic, speech, and word-level linguistic structures to investigate the neural basis of everyday conversations in the human brain. They utilized electrocorticography to record neural signals across 100 hours of natural speech production and detailed as participants engaged in open-ended real-life conversations. The team extracted various embedding like low-level acoustic, mid-level speech, and contextual word embeddings from a multimodal speech-to-text model called Whisper. Their model predicts neural activity at each level of the language processing hierarchy across hours of previously unseen conversations.
The internal workings of the Whisper acoustic-to-speech-to-language model are examined to model and predict neural activity during daily conversations. Three types of embeddings are extracted from the model for every word patients speak or hear: acoustic embeddings from the auditory input layer, speech embeddings from the final speech encoder layer, and language embeddings from the decoder’s final layers. For each embedding type, electrode-wise encoding models are constructed to map the embeddings to neural activity during speech production and comprehension. The encoding models show a remarkable alignment between human brain activity and the model’s internal population code, accurately predicting neural responses across hundreds of thousands of words in conversational data.
The Whisper model’s acoustic, speech, and language embeddings show exceptional predictive accuracy for neural activity across hundreds of thousands of words during speech production and comprehension throughout the cortical language network. During speech production, a hierarchical processing is observed where articulatory areas (preCG, postCG, STG) are better predicted by speech embeddings, while higher-level language areas (IFG, pMTG, AG) align with language embeddings. The encoding models show temporal specificity, with performance peaking more than 300ms before word onset during production and 300ms after onset during comprehension, with speech embeddings better predicting activity in perceptual and articulatory areas and language embeddings excelling in high-order language areas.
In summary, the acoustic-to-speech-to-language model offers a unified computational framework for investigating the neural basis of natural language processing. This integrated approach is a paradigm shift toward non-symbolic models based on statistical learning and high-dimensional embedding spaces. As these models evolve to process natural speech better, their alignment with cognitive processes may similarly improve. Some advanced models like GPT-4o incorporate visual modality alongside speech and text, while others integrate embodied articulation systems mimicking human speech production. The fast improvement of these models supports a shift to a unified linguistic paradigm that emphasizes the role of usage-based statistical learning in language acquisition as it is materialized in real-life contexts.
Check out the Paper, and Google Blog. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.
The post A Unified Acoustic-to-Speech-to-Language Embedding Space Captures the Neural Basis of Natural Language Processing in Everyday Conversations appeared first on MarkTechPost.