Alexa Junior: Real-Time Keyword Spotting on FPGA
Abstract
Abstract
Summary
A 31-class CNN-based voice keyword classifier implemented entirely on the Xilinx Artix-7 FPGA using 8-bit fixed-point arithmetic — no cloud, no FPU, no compromise on latency.
Aim
To design and deploy a real-time, edge-native keyword detection system on an FPGA that classifies audio keywords locally within a 1-second recording window — eliminating cloud dependency and achieving sub-millisecond inference latency.
Introduction
Commercial voice assistants like Alexa and Siri offload all speech inference to the cloud, creating hard dependencies on network stability and introducing round-trip latency that is unacceptable for embedded, real-time applications. Alexa Junior tackles this problem head-on by implementing a Convolutional Neural Network (CNN)-based keyword spotter directly on an FPGA — specifically the Xilinx Artix-7 (Nexys 4 DDR) — with no floating-point unit, no external processor, and no internet connection.
The system classifies 31 keyword classes from the Google Speech Commands dataset and outputs results instantly via onboard LEDs, demonstrating that powerful AI inference is achievable at the extreme edge.
Literature Survey & Technologies Used
Key Challenges in Edge Audio Inference
- Deploying neural networks on FPGAs involves navigating several fundamental constraints:
- Memory Constraints: The Nexys 4 DDR board provides only ~600 KB of BRAM. Standard CNN models can exceed 100 MB, demanding aggressive compression strategies.
- No Floating-Point Support: FPGAs lack native float/double arithmetic units, requiring all computations to be reformulated in fixed-point representation.
- Real-Time Requirements: Audio must be classified within the 1-second recording window, placing strict latency bounds on every pipeline stage.
- Compute Budget: DSP48 slices — the hardware multipliers on the FPGA — are finite and must be carefully allocated to avoid timing violations at 100 MHz.
Technologies Used
- Xilinx Artix-7 FPGA (XC7A100T-1CSG324C) on the Nexys 4 DDR board
- Verilog HDL for hardware description and pipeline design
- PyTorch / Keras for CNN training and weight export
- 8-bit Fixed-Point Arithmetic for hardware-compatible inference
- UART Protocol (115,200 baud) for host-to-FPGA data transfer
- Mel-Spectrogram Feature Extraction for audio pre-processing
Methodology
Software Pipeline
Raw audio is first transformed into a Mel-Spectrogram — a 2D frequency-time representation scaled to match human auditory perception. This surfaces speech-critical features while suppressing irrelevant noise. The CNN weights, originally trained in 32-bit floating point using PyTorch/Keras, are then quantized to 8-bit signed fixed-point format using a custom fixedpoint8.v module. This quantization achieves a 4× memory reduction with negligible accuracy loss. The quantized weights are staged as .COE and Hex array files for BRAM initialization.
Hardware Architecture
The FPGA inference pipeline is composed of the following Verilog modules:
convunit.v — High-speed multiply-accumulate (MAC) unit leveraging FPGA DSP48 slices for 8-bit fixed-point convolution.
rfselector.v — Sliding window logic that manages the receptive field across the input spectrogram.
relu_maxpool.v — Applies ReLU activation (max(0, x)) and spatial downsampling to reduce data volume between layers.
feature_bram.v — Dual-port BRAM enabling seamless inter-layer data streaming with parallel kernel computation.
argmax.v — Decodes the highest-scoring class index into a one-hot vector for LED output.
Data Transfer & FSM Control
Mel-spectrogram data is transmitted from the host PC to the FPGA over UART at 115,200 baud via the onboard FTDI USB-to-UART bridge. Each pixel is sent as an 8-bit frame (start bit → 8 data bits → stop bit). A Finite State Machine (FSM) governs the inference pipeline through four states:
|
State |
Function |
|---|---|
| IDLE | Awaiting the first spectrogram byte after reset |
| LOAD | Buffering incoming pixels into feature_bram.v |
| COMPUTE | Firing the full CNN inference pipeline |
| DONE | Latching argmax result to LED outputs |
Hardware Configuration (Nexys 4 DDR)Resource
| Resource | Detail |
|---|---|
| Target Device | Xilinx Artix-7 XC7A100T-1CSG324C |
| Clock | 100 MHz on Pin E3 |
| UART RX | Pin C4, 115,200 baud |
| Output | 16 LEDs (H17→P14), LVCMOS33 |
Results
Resource Utilization
| Resource | Used | Available | Utilization |
|---|---|---|---|
| BRAM Tiles | 87.5 | 135 | 65% |
| DSP Slices | 120 | 240 | 50% |
| LUTs | 18,500 | 63,400 | ~29% |
Sufficient BRAM headroom remains to expand the keyword vocabulary without redesigning the memory architecture. Dedicated DSP48 slices accelerate all 8-bit fixed-point convolutions, freeing LUT fabric for control logic. Low overall utilization translates to a minimal thermal load and high power efficiency — a significant advantage over CPU-based inference.
Classification Performance
- 31 keyword classes from the Google Speech Commands dataset
- <1 ms real-time latency — well within the 1-second recording window
- Binary LED mapping — each classified keyword instantly lights a unique LED with no screen driver or multiplexer overhead
- Entirely edge-native — zero cloud dependency, zero network latency
Engineering Challenges Solved
| Challenge | Solution |
|---|---|
|
Simulation Bottleneck: Simulating 500 ms of UART audio took hours |
Built a Fast-Path Testbench that bypasses UART and feeds pixels directly into BRAM at 100 MHz — reducing simulation time from hours to seconds |
| Synchronous BRAM Latency: 1-clock-cycle read delay caused MAC data mismatches | Introduced a 1-cycle Control Pipeline stage to re-align the CNN state machine with memory output timing |
| Resource Overflow: 32-bit weights exceeded BRAM capacity | Developed 8-bit Signed Quantization (fixedpoint8.v), compressing the model 4× with no measurable accuracy degradation |
Conclusions & Future Scope
Alexa Junior successfully demonstrates that a CNN-based keyword spotter with 31 classes can be fully deployed on a resource-constrained FPGA using 8-bit fixed-point arithmetic, achieving real-time, sub-millisecond latency with a low power footprint — entirely at the edge.
Future directions include:
- Direct Audio Input: Replacing the UART pipeline with an onboard MEMS microphone via a PDM-to-PCM module for truly standalone operation.
- External Memory: Leveraging the 128 MiB DDR2 SDRAM to support larger models like ResNet or Transformers.
- Always-On Mode: Implementing a circular buffer for continuous, live keyword detection without host intervention.
- Expanded Vocabulary: Growing beyond 31 classes using the remaining BRAM and DSP headroom.
References & Links
Google Speech Commands Dataset
Mentors & Mentees
Mentors: Shruthi Hegde
Sathvika Bandi
Mentees: Malepati Yashaswi
Arya Shedbal
Prajna H R
Report Information
Team Members
Team Members
Report Details
Created: April 8, 2026, 12:38 a.m.
Approved by: None
Approval date: None
Report Details
Created: April 8, 2026, 12:38 a.m.
Approved by: None
Approval date: None