Alexa Junior: Real-Time Keyword Spotting on FPGA

This project presents the design and hardware deployment of a real-time, edge-native keyword detection system on the Xilinx Artix-7 FPGA (Nexys 4 DDR). Motivated by the latency and connectivity limitations of cloud-dependent voice assistants, the system implements a Convolutional Neural Network (CNN) capable of classifying 31 keyword classes from the Google Speech Commands dataset — entirely on-chip, with no floating-point unit and no internet connection. Audio input is pre-processed into Mel-Spectrograms on the host and transmitted to the FPGA via UART at 115,200 baud. CNN weights, originally trained in 32-bit floating point using PyTorch/Keras, are quantized to 8-bit signed fixed-point using a custom fixedpoint8.v module, achieving a 4× memory compression with negligible accuracy loss. The inference pipeline is implemented in Verilog and comprises a MAC convolution engine (convunit.v), sliding window receptive field selector (rfselector.v), ReLU and max-pooling unit (relu_maxpool.v), dual-port BRAM feature buffers (feature_bram.v), and an argmax decoder for one-hot LED output. A four-state FSM (IDLE → LOAD → COMPUTE → DONE) governs end-to-end pipeline control. The system achieves sub-millisecond inference latency, consuming 65% of available BRAM, 50% of DSP48 slices, and 29% of LUTs — leaving sufficient headroom for vocabulary expansion. Results confirm that 8-bit fixed-point CNNs deliver high-accuracy, real-time keyword classification within severe resource constraints, demonstrating the viability of fully edge-native AI inference on low-cost FPGAs.

Summary

A 31-class CNN-based voice keyword classifier implemented entirely on the Xilinx Artix-7 FPGA using 8-bit fixed-point arithmetic — no cloud, no FPU, no compromise on latency.

Aim

To design and deploy a real-time, edge-native keyword detection system on an FPGA that classifies audio keywords locally within a 1-second recording window — eliminating cloud dependency and achieving sub-millisecond inference latency.

Introduction

Commercial voice assistants like Alexa and Siri offload all speech inference to the cloud, creating hard dependencies on network stability and introducing round-trip latency that is unacceptable for embedded, real-time applications. Alexa Junior tackles this problem head-on by implementing a Convolutional Neural Network (CNN)-based keyword spotter directly on an FPGA — specifically the Xilinx Artix-7 (Nexys 4 DDR) — with no floating-point unit, no external processor, and no internet connection.

The system classifies 31 keyword classes from the Google Speech Commands dataset and outputs results instantly via onboard LEDs, demonstrating that powerful AI inference is achievable at the extreme edge.

Literature Survey & Technologies Used

Key Challenges in Edge Audio Inference

Deploying neural networks on FPGAs involves navigating several fundamental constraints:
Memory Constraints: The Nexys 4 DDR board provides only ~600 KB of BRAM. Standard CNN models can exceed 100 MB, demanding aggressive compression strategies.
No Floating-Point Support: FPGAs lack native float/double arithmetic units, requiring all computations to be reformulated in fixed-point representation.
Real-Time Requirements: Audio must be classified within the 1-second recording window, placing strict latency bounds on every pipeline stage.
Compute Budget: DSP48 slices — the hardware multipliers on the FPGA — are finite and must be carefully allocated to avoid timing violations at 100 MHz.

Technologies Used

Xilinx Artix-7 FPGA (XC7A100T-1CSG324C) on the Nexys 4 DDR board
Verilog HDL for hardware description and pipeline design
PyTorch / Keras for CNN training and weight export
8-bit Fixed-Point Arithmetic for hardware-compatible inference
UART Protocol (115,200 baud) for host-to-FPGA data transfer
Mel-Spectrogram Feature Extraction for audio pre-processing

Methodology

Software Pipeline

Raw audio is first transformed into a Mel-Spectrogram — a 2D frequency-time representation scaled to match human auditory perception. This surfaces speech-critical features while suppressing irrelevant noise. The CNN weights, originally trained in 32-bit floating point using PyTorch/Keras, are then quantized to 8-bit signed fixed-point format using a custom fixedpoint8.v module. This quantization achieves a 4× memory reduction with negligible accuracy loss. The quantized weights are staged as .COE and Hex array files for BRAM initialization.

Hardware Architecture

The FPGA inference pipeline is composed of the following Verilog modules:

convunit.v — High-speed multiply-accumulate (MAC) unit leveraging FPGA DSP48 slices for 8-bit fixed-point convolution.

rfselector.v — Sliding window logic that manages the receptive field across the input spectrogram.

relu_maxpool.v — Applies ReLU activation (max(0, x)) and spatial downsampling to reduce data volume between layers.

feature_bram.v — Dual-port BRAM enabling seamless inter-layer data streaming with parallel kernel computation.

argmax.v — Decodes the highest-scoring class index into a one-hot vector for LED output.

Data Transfer & FSM Control

Mel-spectrogram data is transmitted from the host PC to the FPGA over UART at 115,200 baud via the onboard FTDI USB-to-UART bridge. Each pixel is sent as an 8-bit frame (start bit → 8 data bits → stop bit). A Finite State Machine (FSM) governs the inference pipeline through four states:

State	Function
IDLE	Awaiting the first spectrogram byte after reset
LOAD	Buffering incoming pixels into feature_bram.v
COMPUTE	Firing the full CNN inference pipeline
DONE	Latching argmax result to LED outputs

Hardware Configuration (Nexys 4 DDR)Resource

Resource	Detail
Target Device	Xilinx Artix-7 XC7A100T-1CSG324C
Clock	100 MHz on Pin E3
UART RX	Pin C4, 115,200 baud
Output	16 LEDs (H17→P14), LVCMOS33

Results

Resource Utilization

Resource	Used	Available	Utilization
BRAM Tiles	87.5	135	65%
DSP Slices	120	240	50%
LUTs	18,500	63,400	~29%

Sufficient BRAM headroom remains to expand the keyword vocabulary without redesigning the memory architecture. Dedicated DSP48 slices accelerate all 8-bit fixed-point convolutions, freeing LUT fabric for control logic. Low overall utilization translates to a minimal thermal load and high power efficiency — a significant advantage over CPU-based inference.

Classification Performance

31 keyword classes from the Google Speech Commands dataset
<1 ms real-time latency — well within the 1-second recording window
Binary LED mapping — each classified keyword instantly lights a unique LED with no screen driver or multiplexer overhead
Entirely edge-native — zero cloud dependency, zero network latency

Engineering Challenges Solved

Challenge	Solution
Simulation Bottleneck: Simulating 500 ms of UART audio took hours	Built a Fast-Path Testbench that bypasses UART and feeds pixels directly into BRAM at 100 MHz — reducing simulation time from hours to seconds
Synchronous BRAM Latency: 1-clock-cycle read delay caused MAC data mismatches	Introduced a 1-cycle Control Pipeline stage to re-align the CNN state machine with memory output timing
Resource Overflow: 32-bit weights exceeded BRAM capacity	Developed 8-bit Signed Quantization (fixedpoint8.v), compressing the model 4× with no measurable accuracy degradation

Conclusions & Future Scope

Alexa Junior successfully demonstrates that a CNN-based keyword spotter with 31 classes can be fully deployed on a resource-constrained FPGA using 8-bit fixed-point arithmetic, achieving real-time, sub-millisecond latency with a low power footprint — entirely at the edge.

Future directions include:

Direct Audio Input: Replacing the UART pipeline with an onboard MEMS microphone via a PDM-to-PCM module for truly standalone operation.
External Memory: Leveraging the 128 MiB DDR2 SDRAM to support larger models like ResNet or Transformers.
Always-On Mode: Implementing a circular buffer for continuous, live keyword detection without host intervention.
Expanded Vocabulary: Growing beyond 31 classes using the remaining BRAM and DSP headroom.

References & Links

Google Speech Commands Dataset

Xilinx Artix-7 Documentation

Nexys 4 DDR Reference Manual

GitHub Repository

Mentors & Mentees

Mentors: Shruthi Hegde

Sathvika Bandi

Mentees: Malepati Yashaswi

Arya Shedbal

Prajna H R

Virtual Expo 2026

Abstract