Virtual Expo 2025

Intrusion Detection System

Year Long Project CompSoc

Intrusion Detection System

Aim

To develop an intelligent, real-time intrusion detection system that leverages deep learning and synthetic data generation to accurately detect and classify sophisticated network threats, while offering an interactive and scalable interface for live network monitoring and analysis.

This system integrates advanced neural architectures, synthetic data augmentation using CTGAN, and real-time threat visualization with technologies like FastAPI and React.js to enhance cybersecurity infrastructure.

 

Introduction

In today's rapidly evolving digital ecosystem, the threat landscape in cybersecurity has become increasingly complex and dynamic. Traditional defense mechanisms often fall short in the face of sophisticated attacks that continuously adapt and evolve. To address these challenges, this project presents an Intelligent Intrusion Detection System (IDS) that leverages cutting-edge machine learning and real-time network analysis to safeguard digital infrastructures.

At its core, the system integrates a deep learning architecture trained on the widely recognized NSL-KDD and UNSW-NB15 datasets. The model features stacked encoders for advanced feature extraction and gated convolutional layers that dynamically weigh potential threats, resulting in high-precision multi-class attack classification. To overcome limitations of real-world data availability and class imbalance, CTGAN-based synthetic data generation is employed, enhancing both model generalization and resilience.

The system also incorporates real-time threat monitoring through seamless integration with tools like Wireshark, enabling millisecond-level response times and comprehensive traffic analysis. Designed with scalability and usability in mind, the platform includes a React.js-based frontend for interactive dashboards and live threat visualizations, and a FastAPI-based backend for high-performance microservices.

This project not only demonstrates the power of machine learning in cybersecurity but also provides a modular, scalable, and effective framework for real-world network defense

 

Technologies Used

Frontend - React.js

Backend - Python 3.8+ , FastAPI

ML Framework - Pytorch, Scikit-learn

Data Processing - Pandas, Numpy

Network Tools- Wireshark, Scapy

 

Literature Survey

High-dimensional network traffic data poses challenges like increased complexity and reduced detection accuracy. To tackle this, researchers have explored dimensionality reduction through feature selection and extraction. Deep learning approaches, such as combining LSTMs with fully connected networks, have shown strong performance in identifying diverse attack patterns.

Models using Deep Autoencoders (DAE) and DNNs demonstrated good accuracy but lacked robustness in noisy conditions. To improve this, Stacked Contractive Autoencoders (SCAE) were introduced for better noise resistance, often paired with SVMs for classification. While effective, feature extraction remains a challenge. Recent studies propose combining SCAE with gated convolution techniques to improve stability and detection performance.

 

Methodology

  • Intrusion Detection System 

  • Dataset Preparation

Datasets: UNSW-NB15 and NSL-KDD (contain labeled normal and malicious network traffic)

Preprocessing:

  • Drop unnecessary columns
  • One-hot encode categorical features (e.g., protocol, service)
  • Apply MinMax normalization on numerical values
  • Model Architecture

Core Model: Stacked Contractive Autoencoder with Gated Convolution (SCAE-GC)

Layers: Input → 128 → 72 → 30 with Jacobian-based regularization

Gated Convolutions: Select important features via sigmoid gating

Classification: Fully connected layers for multi-class output

  • Training

Hierarchical pre-training of each CAE layer

Loss: Reconstruction error + Jacobian penalty

Integration of pre-trained encoders with gated convolution and classifier

  • Results

Achieved 97.31% accuracy on the test set

  • Capturing Packets

To facilitate the detection of network intrusions in real-time environments, this study implements a methodology to capture and process live network traffic, transforming it into structured features inspired by the NSL-KDD dataset. This approach ensures compatibility with traditional machine learning-based intrusion detection systems while operating on real-world data.

  • Live Traffic Acquisition

Network packets are captured in real-time from the active wireless interface using a command-line network protocol analyzer. The capture process runs for a pre-defined duration, allowing for the collection of a representative segment of network activity. The captured traffic is stored in a standardized packet capture (.pcap) format for subsequent analysis.

  • Connection-Based Feature Extraction

The captured packets are parsed and organized into individual network connections based on source and destination IP addresses, ports, and protocol. For each connection, a comprehensive set of features is derived, encompassing multiple dimensions of traffic behavior:

  • Basic Traffic Features: Capture essential flow characteristics such as connection duration, byte counts in each direction, protocol type, and anomalies in packet structure or flag settings.

  • Content Features: Focus on the payload content of the connection to identify patterns indicative of intrusion attempts, such as repeated login failures, privilege escalation commands, or unauthorized shell access.

  • Temporal Features: Analyze the frequency and timing of connections originating from the same source IP to detect suspicious bursts of activity or scanning behavior.

  • Host-Based Features: Examine traffic targeting the same destination host, assessing service diversity, source variation, and error patterns to flag potential coordinated attacks.

  • Feature Formatting

All extracted features are compiled and stored in a machine learning-friendly format consistent with the ARFF specification. This format includes all 41 features defined in the NSL-KDD schema, ensuring the dataset is readily usable for training and evaluating intrusion detection models.

 

  •  Generating Synthetic Data using CTGANs

To address class imbalance, data scarcity, and privacy concerns in network intrusion detection—especially within benchmark datasets like NSL-KDD—this methodology leverages a Conditional Tabular Generative Adversarial Network (CTGAN) to generate realistic, class-balanced, and privacy-preserving synthetic data. The approach consists of three major components:

  • Real-Time Data Capture and Feature Extraction
  • A real-world network traffic capture system is deployed to collect live traffic via a packet sniffer from an active network interface. Captured packets are stored as .pcap files and then processed to extract NSL-KDD-style features, which include:
  • Basic traffic features (e.g., duration, protocol, byte count, flags),
  • Content-based features (e.g., login attempts, shell access),
  • Time-based metrics (e.g., connection rates from the same source),
  • Host-based features (e.g., connections to the same destination).

The extracted features are structured into a tabular format for consistency and compatibility with downstream generative modeling.

  • Data Transformation for Mixed-Type Features

To prepare this heterogeneous dataset for generative modeling, a specialized bidirectional transformation pipeline is applied:

  • Modeled using Gaussian Mixture Models (GMMs), capturing complex distributions with normalized values and soft assignments to Gaussian components.
  • Encoded using one-hot encoding to retain distinctiveness without ordinal bias.
  • The transformation supports parallel execution, enabling efficient handling of large-scale datasets.
  • It supports inverse transformations, allowing model outputs to be reconstructed back into human-interpretable formats for downstream tasks.

This ensures all 41 NSL-KDD features are transformed into a machine-learning-compatible format while preserving their semantic meaning.

 

  • Synthetic Data Generation Using CTGAN

After preprocessing, a CTGAN model is trained to synthesize high-quality tabular data conditioned on specific feature values (e.g., attack types).

  • Conditional Sampling Strategy

To emphasize minority class representation, conditional sampling is performed:

  • Discrete columns are identified using metadata from preprocessing.
  • Category-wise distribution statistics are computed.
  • Weighted and smoothed sampling mechanisms ensure underrepresented categories are appropriately prioritized.
  • Conditional vectors (one-hot representations of classes) are included during both training and generation phases to guide class-specific sample synthesis.
  • CTGAN Architecture and Training

The CTGAN consists of two core adversarial components:

  • Generator: Converts latent noise vectors—optionally conditioned on categorical values—into synthetic tabular samples that mimic the real dataset.
  • Discriminator: Learns to differentiate between real and synthetic samples and enforces data realism and diversity.

Key architectural and training enhancements include:

  • Residual blocks in the generator for stable gradient flow,
  • Sample packing in the discriminator for robustness,
  • Gradient Penalty (WGAN-GP) to maintain Lipschitz continuity,
  • Dropout and LeakyReLU activations to prevent overfitting.

The discriminator is updated multiple times per generator step to maintain adversarial balance. Training continues over several epochs, monitored via generator/discriminator loss metrics. The final generator is saved for reusable on-demand synthetic data generation.

  • Inverse Transformation and Output Assembly

After synthetic samples are generated:

  • They undergo inverse transformation to revert encodings and scaling.
  • Continuous attributes are de-normalized, and categorical features are decoded from one-hot vectors.

The output is a realistic, interpretable, and tabular synthetic dataset, matching the structure of the original, and usable for model training or evaluation.

 

Conclusion

This study proposes an integrated approach to network intrusion detection, combining real-time traffic capture, deep learning classification, and synthetic data generation. Using benchmark datasets like UNSW-NB15 and NSL-KDD, the model achieves 97.31% accuracy with a Stacked Contractive Autoencoder and Gated Convolution (SCAE-GC), enhancing robustness through hierarchical training and dynamic feature selection.

To adapt to real-world scenarios, live packet capture and NSL-KDD-style feature extraction are implemented. Additionally, a CTGAN-based generator addresses data imbalance and privacy concerns by creating realistic, interpretable synthetic samples.

Overall, the methodology offers a scalable and effective framework for modern intrusion detection in dynamic and data-constrained environments.

 

Github Link:

https://github.com/vishruth2005/IntrusionDetectionSystem/

 

Project Mentors:

Gouri M R

Smruthi Bhat

Upasana Nayak

 

Project Mentees:

Vishruth 

Vaibhavi

Prahas

Sashank

Report Information

Explore More Projects

View All 2025 Projects