Virtual Expo 2026

Alexa Junior: Real-Time Keyword Spotting on FPGA

Year Long Project Diode

Summary

A 31-class CNN-based voice keyword classifier implemented entirely on the Xilinx Artix-7 FPGA using 8-bit fixed-point arithmetic — no cloud, no FPU, no compromise on latency.

Aim

To design and deploy a real-time, edge-native keyword detection system on an FPGA that classifies audio keywords locally within a 1-second recording window — eliminating cloud dependency and achieving sub-millisecond inference latency.

Introduction

Commercial voice assistants like Alexa and Siri offload all speech inference to the cloud, creating hard dependencies on network stability and introducing round-trip latency that is unacceptable for embedded, real-time applications. Alexa Junior tackles this problem head-on by implementing a Convolutional Neural Network (CNN)-based keyword spotter directly on an FPGA — specifically the Xilinx Artix-7 (Nexys 4 DDR) — with no floating-point unit, no external processor, and no internet connection.

The system classifies 31 keyword classes from the Google Speech Commands dataset and outputs results instantly via onboard LEDs, demonstrating that powerful AI inference is achievable at the extreme edge.

Literature Survey & Technologies Used

Key Challenges in Edge Audio Inference

  • Deploying neural networks on FPGAs involves navigating several fundamental constraints:
  • Memory Constraints: The Nexys 4 DDR board provides only ~600 KB of BRAM. Standard CNN models can exceed 100 MB, demanding aggressive compression strategies.
  • No Floating-Point Support: FPGAs lack native float/double arithmetic units, requiring all computations to be reformulated in fixed-point representation.
  • Real-Time Requirements: Audio must be classified within the 1-second recording window, placing strict latency bounds on every pipeline stage.
  • Compute Budget: DSP48 slices — the hardware multipliers on the FPGA — are finite and must be carefully allocated to avoid timing violations at 100 MHz.

Technologies Used

  • Xilinx Artix-7 FPGA (XC7A100T-1CSG324C) on the Nexys 4 DDR board
  • Verilog HDL for hardware description and pipeline design
  • PyTorch / Keras for CNN training and weight export
  • 8-bit Fixed-Point Arithmetic for hardware-compatible inference
  • UART Protocol (115,200 baud) for host-to-FPGA data transfer
  • Mel-Spectrogram Feature Extraction for audio pre-processing

Methodology

Software Pipeline

Raw audio is first transformed into a Mel-Spectrogram — a 2D frequency-time representation scaled to match human auditory perception. This surfaces speech-critical features while suppressing irrelevant noise. The CNN weights, originally trained in 32-bit floating point using PyTorch/Keras, are then quantized to 8-bit signed fixed-point format using a custom fixedpoint8.v module. This quantization achieves a 4× memory reduction with negligible accuracy loss. The quantized weights are staged as .COE and Hex array files for BRAM initialization.

Hardware Architecture

The FPGA inference pipeline is composed of the following Verilog modules:

convunit.v — High-speed multiply-accumulate (MAC) unit leveraging FPGA DSP48 slices for 8-bit fixed-point convolution.

rfselector.v — Sliding window logic that manages the receptive field across the input spectrogram.

relu_maxpool.v — Applies ReLU activation (max(0, x)) and spatial downsampling to reduce data volume between layers.

feature_bram.v — Dual-port BRAM enabling seamless inter-layer data streaming with parallel kernel computation.

argmax.v — Decodes the highest-scoring class index into a one-hot vector for LED output.

Data Transfer & FSM Control

Mel-spectrogram data is transmitted from the host PC to the FPGA over UART at 115,200 baud via the onboard FTDI USB-to-UART bridge. Each pixel is sent as an 8-bit frame (start bit → 8 data bits → stop bit). A Finite State Machine (FSM) governs the inference pipeline through four states:

State

Function
IDLE Awaiting the first spectrogram byte after reset
LOAD Buffering incoming pixels into feature_bram.v
COMPUTE Firing the full CNN inference pipeline
DONE Latching argmax result to LED outputs

Hardware Configuration (Nexys 4 DDR)Resource

Resource Detail
Target Device Xilinx Artix-7 XC7A100T-1CSG324C
Clock 100 MHz on Pin E3
UART RX Pin C4, 115,200 baud
Output 16 LEDs (H17→P14), LVCMOS33

 

Results

Resource Utilization

Resource Used Available Utilization
BRAM Tiles 87.5 135 65%
DSP Slices 120 240 50%
LUTs 18,500 63,400 ~29%

Sufficient BRAM headroom remains to expand the keyword vocabulary without redesigning the memory architecture. Dedicated DSP48 slices accelerate all 8-bit fixed-point convolutions, freeing LUT fabric for control logic. Low overall utilization translates to a minimal thermal load and high power efficiency — a significant advantage over CPU-based inference.

Classification Performance

  1. 31 keyword classes from the Google Speech Commands dataset
  2. <1 ms real-time latency — well within the 1-second recording window
  3. Binary LED mapping — each classified keyword instantly lights a unique LED with no screen driver or multiplexer overhead
  4. Entirely edge-native — zero cloud dependency, zero network latency

Engineering Challenges Solved

Challenge Solution

Simulation Bottleneck: Simulating 500 ms of UART audio took hours

Built a Fast-Path Testbench that bypasses UART and feeds pixels directly into BRAM at 100 MHz — reducing simulation time from hours to seconds
Synchronous BRAM Latency: 1-clock-cycle read delay caused MAC data mismatches Introduced a 1-cycle Control Pipeline stage to re-align the CNN state machine with memory output timing
Resource Overflow: 32-bit weights exceeded BRAM capacity Developed 8-bit Signed Quantization (fixedpoint8.v), compressing the model 4× with no measurable accuracy degradation

 

Conclusions & Future Scope

Alexa Junior successfully demonstrates that a CNN-based keyword spotter with 31 classes can be fully deployed on a resource-constrained FPGA using 8-bit fixed-point arithmetic, achieving real-time, sub-millisecond latency with a low power footprint — entirely at the edge.

Future directions include:

  • Direct Audio Input: Replacing the UART pipeline with an onboard MEMS microphone via a PDM-to-PCM module for truly standalone operation.
  • External Memory: Leveraging the 128 MiB DDR2 SDRAM to support larger models like ResNet or Transformers.
  • Always-On Mode: Implementing a circular buffer for continuous, live keyword detection without host intervention.
  • Expanded Vocabulary: Growing beyond 31 classes using the remaining BRAM and DSP headroom.

References & Links

Google Speech Commands Dataset

Xilinx Artix-7 Documentation

Nexys 4 DDR Reference Manual

GitHub Repository

Mentors & Mentees

Mentors: Shruthi Hegde

               Sathvika Bandi

Mentees: Malepati Yashaswi

                Arya Shedbal

                Prajna H R

Report Information

Explore More Projects

View All 2026 Projects