MLForge: A Unified Framework for Cloud-Agnostic MLOps
Abstract
Abstract
Aim
The primary objective of this project is to design and develop an open-source, cloud-agnostic MLOps platform that delivers functionality comparable to—and extensible beyond—modern platforms such as AWS SageMaker, while remaining fully independent of proprietary cloud ecosystems.
The system is intended to serve as an end-to-end lifecycle management platform for machine learning, with the following specific goals:
- Develop a unified framework integrating data ingestion, preprocessing, training, evaluation, deployment, and monitoring.
- Ensure infrastructure independence, enabling deployment across on-premises systems, private clouds, and public cloud environments without vendor lock-in.
- Implement native version control for datasets, models, and pipelines to guarantee bit-perfect reproducibility and traceability.
-
Design a robust experiment tracking system capable of logging hyperparameters, metrics, artifacts, and runtime configurations.
-
Enable automated ML workflows with intelligent dependency resolution and parallel execution.
-
Integrate ML-native CI/CD pipelines with automated validation, regression testing, and controlled deployment.
-
Provide a centralized, high-availability model registry with pluggable backend support.
-
Maintain a fully open-source, modular, and extensible architecture.
INTRODUCTION
Modern machine learning systems suffer from what can be described as a “Crisis of Infrastructure Fragility.” Practitioners frequently rely on fragmented tools—such as DVC for data versioning, MLflow for experiment tracking, and ad hoc scripts for deployment—leading to inefficiencies and hidden technical debt.
This fragmentation introduces severe challenges, including:
- Difficulty in reproducing experiments
- Lack of traceability across pipelines
- Increased operational complexity
MLForge addresses these challenges by introducing a unified, modular architecture that consolidates the entire machine learning lifecycle into a single system of authority.
The platform is structured into five core subsystems:
- Data Versioning
- Environment Provisioning
- Metric Tracking
- Model Registry
- CI/CD Automation
Built using a high-performance FastAPI backend and asynchronous event-driven design, MLForge ensures seamless transitions from raw data ingestion to production deployment, independent of the underlying infrastructure.
LITERATURE SURVEY
Existing MLOps solutions can be broadly categorized into three classes:
Experiment Tracking Systems
Tools such as MLflow and Weights & Biases provide strong visualization capabilities but lack:
- Integrated dataset versioning
- Automated deployment workflows
Model Serving Frameworks
Systems like TorchServe and TensorFlow Serving specialize in inference but offer:
- Limited lifecycle management
- Minimal integration with training pipelines
Proprietary End-to-End Platforms
Platforms such as AWS SageMaker and Kubeflow provide comprehensive solutions but suffer from:
- High operational complexity
- Vendor lock-in
- Infrastructure rigidity
Research Contributions Incorporated
MLForge builds upon established research principles:
- Data lineage tracking for reproducibility (Henderson et al., 2018)
- Automated validation gates (Breck et al., 2019)
- Event-driven ML system monitoring (Sculley et al., 2015)
By combining these principles, MLForge delivers a lightweight yet scalable alternative with a pluggable backend architecture.
TECHNOLOGIES USED
Backend and Core Logic
- Python 3.8+ – Core programming language
- FastAPI – High-performance asynchronous API framework
- Uvicorn – Production-grade ASGI server
Data Management and Storage
- SQLAlchemy ORM – Unified database abstraction
- PostgreSQL – High-concurrency relational storage
- MongoDB / DynamoDB / Firestore – Pluggable NoSQL backends
- SHA-256 Hashing – Content-addressed versioning for reproducibility
Machine Learning and Experimentation
- NumPy & Scikit-Learn – Core numerical computation
- Pydantic – Runtime validation and type safety
- Server-Sent Events (SSE) – Real-time event streaming
- PyTest - For writing integration and unit tests.
Deployment and CI/CD
- Docker – Containerized reproducible environments
- Hugging Face Hub SDK – Model hosting and deployment
- ParallelExecutor – Concurrent execution engine
METHODOLOGY
The architecture of MLForge is based on specialized functional units (“Managers”), each responsible for a distinct stage of the ML lifecycle.
Dataset Versioning System
This phase establishes an immutable data foundation:
- Uses content-addressable storage with SHA-256 hashing
- Ensures exact reproducibility of datasets
- Implements deduplication, storing identical files only once
- Maintains a persistent staging area for tracking changes
Environment Provisioning
To eliminate environment inconsistencies:
- Supports execution across:
- Local systems
- Docker containers
- Remote servers (SSH)
- Cloud VMs
- Automatically detects hardware (e.g., CUDA/MPS)
- Synchronizes Git commits and dependencies
- Creates a clean, reproducible runtime environment
Experiment & Metric Tracking
During training:
- Logs metrics such as loss, accuracy, and artifacts
- Uses batch-flushing strategies for efficient storage
- Streams real-time updates via the SSE Event Bus
- Eliminates the need for inefficient database polling
Model Registry
After training:
- Stores models along with their complete metadata, including:
- Hyperparameters
- Metrics
- Dataset hashes
- Supports backend-agnostic storage
- Enables seamless transition from experimentation to deployment
CI/CD Pipeline
Before deployment:
- Executes automated validation pipelines
- Performs:
- Artifact integrity checks
- Shape validation
- Regression testing
- Uses a Deployment Gate mechanism:
- Only deploys models meeting strict performance thresholds
RESULTS
- Endpoint and Integration Testing
- CI/CD Pipeline Validation
- Dataset Versioning Integrity
- Real-Time Streaming Performance
CONCLUSION
MLForge successfully demonstrates a production-ready, cloud-agnostic MLOps platform that unifies the fragmented machine learning lifecycle into a single, authoritative system.
Key achievements include:
- Complete reproducibility of experiments
- Integrated lifecycle management
- Automated, validated deployment pipelines
- Scalable and modular architecture
As a transparent and cost-effective alternative to proprietary systems, MLForge enables both research and production teams to operate with scientific rigor and engineering efficiency.
Future Work
Future enhancements will focus on:
- Multi-tenant enterprise workflows
- JWT-based role-based access control (RBAC)
- Integration with Prometheus and Grafana for production monitoring
REFERENCES
- Breck, E., et al. (2019). Data Validation for Machine Learning. SysML.
- Sculley, D., et al. (2015). Hidden Technical Debt in Machine Learning Systems. NIPS.
- FastAPI Documentation
- MLForge Repository
- Henderson, et al. (2018). MLOps Best Practices.
PROJECT TEAM
Mentors
- Vishruth V Srivatsa
- Akhil Sakthieswaran
Mentees
- Shriya Bharadwaj
- Ajitesh Kallepalli
- Abhishek Sulakhe
- Siddhanth Saha
Report Information
Report Details
Created: April 7, 2026, 4:14 p.m.
Approved by: None
Approval date: None
Report Details
Created: April 7, 2026, 4:14 p.m.
Approved by: None
Approval date: None