I am wondering how to build a MVP MLOps platform.
A lot of MLOps tool have hidden costs. You need entire teams to maintain them.
That’s not what I want.
I want an MVP platform.
For me an MVP is something that is simple, but complete enough to be useful. Something that allows you to focus on the essentials of building. Something even though it might not be perfect, it might cost more, or it might have a bit of extra latency.
I want something fast to setup, easy to use, and easy to understand.
So let me first define the thing:
An MLOps platform is a set of tools, services, practices and processes that help you build, deploy, and manage your ML models.
I am not interested in the practices and processes for now. This isn’t really something I can MVP.
I want to focus on the tools, services, and practices that are needed to build, deploy, and manage your ML models.
An MLOps platform is very hard to build. It is a complex system that requires a lot of different components to work together.
- Pipeline (Usually separate ones for (a) data (b) training (c) inference)
- Model registry & versioning
- Experiment tracking
- Feature store
- Compute (AI & non-AI)
- Storage (Artifacts, Features, Inputs / Outputs, etc.)
- Monitoring (of the system and the model performance)
- Secrets (to keep my models secure)
- Infrastructure Management (to get all the things up and running)
- CI/CD (to automate the deployment of the things)
That’s a shit ton of stuff I need to build just to get something up and running.
To date, I have only seen two extremes. A massive MLOps platform that requires a bunch of engineers to build and maintain. Or a bunch of small tools that are not integrated and require a lot of manual work to get something new up and running.
I want something in the middle. Something that is simple, but complete enough to be useful.
I am dreaming up a MVP MLOps platform of managed services that are owned by the user. Every service should have a generouse free tier.
I first thought about using pipeline tools like Prefect or Mage. But I think that’s too much. They come with so much overhead. I want everything to be serverless.
- Pipeline: Orchestration with pub/sub (e.g. Pub/Sub, AWS SQS)
- Model registry & Experiment tracking: Weights & Biases
- Compute: Modal or BentoML for AI, Lambda / Cloud Run for non-AI
- Storage: S3 / GCS
- Secrets: AWS Secrets Manager, Google Secrets Manager
- Logging: AWS CloudWatch, Google Cloud Logging, Datadog
- CI / CD: GitHub Actions
How I imagine this system to work:
- We use a ZenML or Metaflow like syntax to define our pipeline.
- Each pipeline step has a pydantic model as an input and output. The input of the first step is the input of the pipeline. The output of the last step is the output of the pipeline.
- Each pipeline step is deployed as a serverless function. Lambda, Cloud Run, Modal, BentoML, etc.
- Everything is code.