Virtual Expo 2026

MLForge: A Unified Framework for Cloud-Agnostic MLOps

Year Long Project CompSoc

Aim

The primary objective of this project is to design and develop an open-source, cloud-agnostic MLOps platform that delivers functionality comparable to—and extensible beyond—modern platforms such as AWS SageMaker, while remaining fully independent of proprietary cloud ecosystems.

The system is intended to serve as an end-to-end lifecycle management platform for machine learning, with the following specific goals:

  • Develop a unified framework integrating data ingestion, preprocessing, training, evaluation, deployment, and monitoring.
  • Ensure infrastructure independence, enabling deployment across on-premises systems, private clouds, and public cloud environments without vendor lock-in.
  • Implement native version control for datasets, models, and pipelines to guarantee bit-perfect reproducibility and traceability.
  • Design a robust experiment tracking system capable of logging hyperparameters, metrics, artifacts, and runtime configurations.

  • Enable automated ML workflows with intelligent dependency resolution and parallel execution.

  • Integrate ML-native CI/CD pipelines with automated validation, regression testing, and controlled deployment.

  • Provide a centralized, high-availability model registry with pluggable backend support.

  • Maintain a fully open-source, modular, and extensible architecture.


INTRODUCTION

Modern machine learning systems suffer from what can be described as a “Crisis of Infrastructure Fragility.” Practitioners frequently rely on fragmented tools—such as DVC for data versioning, MLflow for experiment tracking, and ad hoc scripts for deployment—leading to inefficiencies and hidden technical debt.

This fragmentation introduces severe challenges, including:

  • Difficulty in reproducing experiments
  • Lack of traceability across pipelines
  • Increased operational complexity

MLForge addresses these challenges by introducing a unified, modular architecture that consolidates the entire machine learning lifecycle into a single system of authority.

The platform is structured into five core subsystems:

  1. Data Versioning
  2. Environment Provisioning
  3. Metric Tracking
  4. Model Registry
  5. CI/CD Automation

Built using a high-performance FastAPI backend and asynchronous event-driven design, MLForge ensures seamless transitions from raw data ingestion to production deployment, independent of the underlying infrastructure.


LITERATURE SURVEY

Existing MLOps solutions can be broadly categorized into three classes:

Experiment Tracking Systems

Tools such as MLflow and Weights & Biases provide strong visualization capabilities but lack:

  • Integrated dataset versioning
  • Automated deployment workflows

Model Serving Frameworks

Systems like TorchServe and TensorFlow Serving specialize in inference but offer:

  • Limited lifecycle management
  • Minimal integration with training pipelines

Proprietary End-to-End Platforms

Platforms such as AWS SageMaker and Kubeflow provide comprehensive solutions but suffer from:

  • High operational complexity
  • Vendor lock-in
  • Infrastructure rigidity

Research Contributions Incorporated

MLForge builds upon established research principles:

  • Data lineage tracking for reproducibility (Henderson et al., 2018)
  • Automated validation gates (Breck et al., 2019)
  • Event-driven ML system monitoring (Sculley et al., 2015)

By combining these principles, MLForge delivers a lightweight yet scalable alternative with a pluggable backend architecture.


TECHNOLOGIES USED

Backend and Core Logic

  • Python 3.8+ – Core programming language
  • FastAPI – High-performance asynchronous API framework
  • Uvicorn – Production-grade ASGI server

Data Management and Storage

  • SQLAlchemy ORM – Unified database abstraction
  • PostgreSQL – High-concurrency relational storage
  • MongoDB / DynamoDB / Firestore – Pluggable NoSQL backends
  • SHA-256 Hashing – Content-addressed versioning for reproducibility

Machine Learning and Experimentation

  • NumPy & Scikit-Learn – Core numerical computation
  • Pydantic – Runtime validation and type safety
  • Server-Sent Events (SSE) – Real-time event streaming
  • PyTest - For writing integration and unit tests.

Deployment and CI/CD

  • Docker – Containerized reproducible environments
  • Hugging Face Hub SDK – Model hosting and deployment
  • ParallelExecutor – Concurrent execution engine

METHODOLOGY

The architecture of MLForge is based on specialized functional units (“Managers”), each responsible for a distinct stage of the ML lifecycle.

Dataset Versioning System

This phase establishes an immutable data foundation:

  • Uses content-addressable storage with SHA-256 hashing
  • Ensures exact reproducibility of datasets
  • Implements deduplication, storing identical files only once
  • Maintains a persistent staging area for tracking changes

Environment Provisioning

To eliminate environment inconsistencies:

  • Supports execution across:
    • Local systems
    • Docker containers
    • Remote servers (SSH)
    • Cloud VMs
  • Automatically detects hardware (e.g., CUDA/MPS)
  • Synchronizes Git commits and dependencies
  • Creates a clean, reproducible runtime environment

Experiment & Metric Tracking

During training:

  • Logs metrics such as loss, accuracy, and artifacts
  • Uses batch-flushing strategies for efficient storage
  • Streams real-time updates via the SSE Event Bus
  • Eliminates the need for inefficient database polling

Model Registry

After training:

  • Stores models along with their complete metadata, including:
    • Hyperparameters
    • Metrics
    • Dataset hashes
  • Supports backend-agnostic storage
  • Enables seamless transition from experimentation to deployment

CI/CD Pipeline

Before deployment:

  • Executes automated validation pipelines
  • Performs:
    • Artifact integrity checks
    • Shape validation
    • Regression testing
  • Uses a Deployment Gate mechanism:
    • Only deploys models meeting strict performance thresholds

RESULTS

  • Endpoint and Integration Testing
  • CI/CD Pipeline Validation
  • Dataset Versioning Integrity
  • Real-Time Streaming Performance

CONCLUSION

MLForge successfully demonstrates a production-ready, cloud-agnostic MLOps platform that unifies the fragmented machine learning lifecycle into a single, authoritative system.

Key achievements include:

  • Complete reproducibility of experiments
  • Integrated lifecycle management
  • Automated, validated deployment pipelines
  • Scalable and modular architecture

As a transparent and cost-effective alternative to proprietary systems, MLForge enables both research and production teams to operate with scientific rigor and engineering efficiency.


Future Work

Future enhancements will focus on:

  • Multi-tenant enterprise workflows
  • JWT-based role-based access control (RBAC)
  • Integration with Prometheus and Grafana for production monitoring

REFERENCES

  1. Breck, E., et al. (2019). Data Validation for Machine Learning. SysML.
  2. Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. NIPS.
  3. FastAPI Documentation
  4. MLForge Repository
  5. Henderson, et al. (2018). MLOps Best Practices.

PROJECT TEAM

Mentors

  • Vishruth V Srivatsa
  • Akhil Sakthieswaran

Mentees

  • Shriya Bharadwaj
  • Ajitesh Kallepalli
  • Abhishek Sulakhe
  • Siddhanth Saha

Report Information

Explore More Projects

View All 2026 Projects