Audio Neuro Processing Session

Organizer:

Daniel Ben Dayan Rubin

Session Recording

Part I

Part II

Part I - Keyword Spotting, ASR, and Cochlea Models

Presentations 8:00-8:45

 

Intel Wake on Voice – methods and challenges

Kuba Łopatka – CIG, Intel Corp. Poland (IGK)

Intel Wake on Voice is a low-power keyword spotting IP optimized for Intel DSP environment. The IP has been successfully shipped to more than 400 client platforms. The main remaining challenges are: acoustic robustness, adaptability to new keywords and accents, as well as compute and memory constraints. In this talk we introduce the methods employed in Wake on Voice and discuss future development directions.  

Keyword Spotting with Neuromorphic Computing in the Automotive Context

Gerrit Ecke, Mercedes-Benz AG, Group Research, Future Technologies, Böblingen, Germany

At Mercedes-Benz, we evaluate how neuromorphic computing enables future AI applications in the car. We see the prospect of edge applications with reduced energy consumption as an important aspect for sustainable AI. One of our projects focuses on keyword spotting, a use case already available in our cars. Goal of the project is to assess the implementation of inference tasks with respect to energy consumption, accuracy, latency and network scale.

Speech-to-Spikes and Deep SNN Architectures for Speech Command Recognition

Timotey Shea – Accenture Research Lab, San Francisco

Improved voice controls for smart products are among the most promising applications for neuromorphic technology. Neuromorphic solutions for speech command recognition offer significant user benefits over the typical approach today, which relies on uploading segmented audio recordings to cloud-based AI services. As such, researchers have proposed a variety of methods and models to achieve robust command recognition within the constraints of neuromorphic hardware. In a recent project, Accenture Labs applied and evaluated and several of these approaches in the process of building an automotive speech command prototype. In this brief talk, I’ll discuss two facets of that project: a comparison of audio spike conversion methods, and an exploration of deep spiking neural network architectures. In both of these areas, I’ll describe some of the tradeoffs in terms of development complexity and performance on Intel Loihi.

Keyword Spotting on Loihi using the Time Difference Encoder

Lyes Khacef, Eisabetta Chicca, Groningen Cognitive Systems and Materials Center, U. of Groningen, The Netherlands

We are able to capture sound and spot a given word in it thanks to a very important organ in our ear : the cochlea. It naturally decomposes sound into frequencies (like a fourrier transform!) and transmit their amplitudes to the auditory nerve. It is hypothesized that the delays between different frequencies encode the temporal pattern of the sound. In order to replicate this mechanism, we will use the recently proposed Time Difference Encoder (TDE). We first quantify the impact in has on the classes separability using information theory, then we use a Spiking Neural Network to assess the classification accuracy in keyword spotting. The system will be first simulated using Nengo and then implemented on Loihi to measue the gain in latency and energy-efficiency at the dge compared to standard CPU/GPU approaches.

Spiking cochlea for edge audio applications

Shih-Chii Liu, INI, UNIZH|ETHZ, Switzerland

Spiking silicon cochleas or Dynamic Audio Sensor (DAS) produce asynchronous frequency-specific channel spikes. The asynchronous sampling from active frequency channels leads to sparser sampling of frequency information compared to the use of a maximal sampling rate on the incoming sound input. In this talk, I will present the development of event driven machine learning and spiking network algorithms that can be applied on the cochlea for audio tasks such as sound source localization, voice activity detection and keyword spotting. I will also describe a hardware system that combines the DAS with our recently proposed low-compute spike-inspired delta recurrent neural network which is implemented on an FPGA platform and deployed for a real-time keyword spotting task. This system can be mapped to a low-power (uW) ASIC

Speech recognition with sparse spike code

Dezhe Jin, Penn State U 

On-device speech recognition requires energy efficiency and noise robustness. We explore a speech recognition approach based on how the brain transforms auditory signals. We decompose auditory inputs into sparse spikes of feature detectors, and decode speech through pattern recognitions on the spike sequences. We are working on realizing the system on Loihi.

Audio-Visual Scene Analysis with spike-based representations for SNN, implementation for potential Safety and Monitoring real-life situations

Soufiyan Bahadi, Eric Plourde and Jean Rouat, U. of Sherbrooke, Canada

We focus on a scene understanding scenario that needs audio inputs to remove ambiguity in the visual scenes. Audio and visual inputs need to be converted into spikes for SNN use. Our lab developed automatic audio-visual scene analysis systems based on deep learning architectures. We are porting them on neuromorphic architectures. We first study two strategies of audio spike based representations like spikegrams and sparse representations (via LCA for audio) to be used with the original SECL-UMONS database. Then, we plan to evaluate the impact of the use of DAVIS cameras and DAS microphones on our SNN implementations for multimodal scene analysis. We also plan to modify our architectures to make them compatible with real time flow processing.

Part II
Invited Talk: 8:45-9:30

Shihab Shamma, U. of Maryland

Learning Perception and Action through Sensorimotor Interactions

This talk will review the auditory cortical processing of complex sounds such as speech and music. We shall illustrate how Neuromorphic representations facilitate performance of difficult tasks such as audio source segregation from mixtures, and how these are relate`d to recent DNN algorithms to perform these tasks. We then discuss how the brain learns to combine action and perception as in learning how to speak or play a musical instrument. These processes suggest new DNN architectures for unsupervised learning modeled after the sensorimotor interactions in the brain. 

Part II
Invited Talk: 8:45-9:30

Shihab Shamma, U. of Maryland

Learning Perception and Action through Sensorimotor Interactions

This talk will review the auditory cortical processing of complex sounds such as speech and music. We shall illustrate how Neuromorphic representations facilitate performance of difficult tasks such as audio source segregation from mixtures, and how these are relate`d to recent DNN algorithms to perform these tasks. We then discuss how the brain learns to combine action and perception as in learning how to speak or play a musical instrument. These processes suggest new DNN architectures for unsupervised learning modeled after the sensorimotor interactions in the brain. 

Round Table 9:30-10:00

On Audio neuromorphic apps pros and cons

was moved to the following week

– The partecipants are invited to post questions they would like to be discussed (to be updated)

  1. On analog conversion of audio spiking population signal, best methods, limits, promising directions?

  2. What application domains are best profiting from SNN audio?

Session Recording

Part I

Part II

Please use the comment section on this page to ask questions or comment about this specific presentation.