Deploy as HTTP API
MLOps Desktop includes a built-in HTTP server for model serving. Deploy any trained model and get predictions via REST API—no additional infrastructure needed.
Time to complete: ~10 minutes
Prerequisites
Section titled “Prerequisites”- A trained model (complete the Quickstart first)
- Python packages:
pip install fastapi uvicorn slowapi - Optional for ONNX:
pip install onnxruntime
Using the Serving Tab
Section titled “Using the Serving Tab”-
Open the Serving tab
In the Output Panel at the bottom, click the Serving tab.
-
Select a model
Choose from:
- Registered models from the Models tab
- Latest trained model from the current session
Select a specific version if multiple exist.
-
Configure the server
Click the Configure button (gear icon):
Setting Default Description Host 0.0.0.0 Listen address Port 8000 HTTP port Use ONNX Runtime Off Enable for faster inference -
Start the server
Click Start Server.
Status changes:
Stopped → Starting → RunningYou’ll see the server URL:
http://localhost:8000 -
Make predictions
The API is now ready. Use curl, Python, or any HTTP client.
API Endpoints
Section titled “API Endpoints”Once running, the server provides these endpoints:
Health Check
Section titled “Health Check”curl http://localhost:8000/healthResponse:
{"status": "healthy", "model": "RandomForestClassifier"}Predict
Section titled “Predict”curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{"features": [[5.1, 3.5, 1.4, 0.2]]}'Response (classification):
{ "predictions": [0], "probabilities": [[0.98, 0.01, 0.01]]}Response (regression):
{ "predictions": [24.5]}Batch Predictions
Section titled “Batch Predictions”Send multiple samples:
curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{ "features": [ [5.1, 3.5, 1.4, 0.2], [6.2, 3.4, 5.4, 2.3], [4.9, 2.5, 4.5, 1.7] ] }'API Documentation
Section titled “API Documentation”FastAPI provides automatic docs:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
Request Metrics
Section titled “Request Metrics”The Serving tab shows real-time metrics:
| Metric | Description |
|---|---|
| Total Requests | Count since server start |
| Success Rate | Percentage of 2xx responses |
| Avg Latency | Mean response time |
| Requests/min | Current throughput |
Request Log
Section titled “Request Log”A table shows recent requests:
| Time | Method | Path | Status | Latency | Batch Size |
|---|---|---|---|---|---|
| 10:30:15 | POST | /predict | 200 | 12ms | 1 |
| 10:30:18 | POST | /predict | 200 | 45ms | 100 |
| 10:30:22 | GET | /health | 200 | 2ms | - |
ONNX Runtime
Section titled “ONNX Runtime”Enable ONNX for faster inference:
- Install onnxruntime:
pip install onnxruntime - Export your model as ONNX (in ModelExporter node)
- In Serving config, enable Use ONNX Runtime
- Select the
.onnxmodel file
Using with Python
Section titled “Using with Python”import requests
# Single predictionresponse = requests.post( "http://localhost:8000/predict", json={"features": [[5.1, 3.5, 1.4, 0.2]]})result = response.json()print(f"Predicted class: {result['predictions'][0]}")print(f"Confidence: {max(result['probabilities'][0]):.1%}")
# Batch predictionimport pandas as pddf = pd.read_csv("new_data.csv")features = df.drop(columns=["target"]).values.tolist()
response = requests.post( "http://localhost:8000/predict", json={"features": features})predictions = response.json()["predictions"]Using with JavaScript
Section titled “Using with JavaScript”const response = await fetch('http://localhost:8000/predict', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ features: [[5.1, 3.5, 1.4, 0.2]] })});
const { predictions, probabilities } = await response.json();console.log(`Predicted: ${predictions[0]}`);Server Lifecycle
Section titled “Server Lifecycle”| State | Description |
|---|---|
| Stopped | Server not running |
| Starting | Loading model, initializing FastAPI |
| Running | Accepting requests |
| Stopping | Graceful shutdown |
Click Stop Server to shut down gracefully.
Export for Production
Section titled “Export for Production”For production deployment outside MLOps Desktop:
Use the ModelExporter node to save:
.joblib— Python sklearn applications.onnx— Cross-platform, optimized inference.pkl— Python native (security risk)
Create your own FastAPI server:
from fastapi import FastAPIimport joblibimport numpy as np
app = FastAPI()model = joblib.load("model.joblib")
@app.post("/predict")async def predict(data: dict): features = np.array(data["features"]) predictions = model.predict(features) return {"predictions": predictions.tolist()}Run with: uvicorn server:app --host 0.0.0.0 --port 8000
Dependencies
Section titled “Dependencies”The server requires these packages:
| Package | Purpose | Required |
|---|---|---|
fastapi | Web framework | Yes |
uvicorn | ASGI server | Yes |
slowapi | Rate limiting | Yes |
onnxruntime | ONNX inference | Optional |
Install all:
pip install fastapi uvicorn slowapi onnxruntimeThe Serving tab checks for dependencies and shows warnings if any are missing.
Common Issues
Section titled “Common Issues””FastAPI not installed”
Section titled “”FastAPI not installed””pip install fastapi uvicorn slowapiPort already in use
Section titled “Port already in use”Change the port in server configuration, or stop the existing process:
lsof -i :8000 # Find processkill -9 <PID> # Stop itSlow predictions
Section titled “Slow predictions”- Enable ONNX Runtime for faster inference
- Use batch predictions instead of single requests
- Check model complexity (large Random Forest = slow)
CORS errors from browser
Section titled “CORS errors from browser”The server allows all origins by default. If still having issues, check your request headers.
Rate Limiting
Section titled “Rate Limiting”The server includes basic rate limiting via slowapi:
- Default: 100 requests/minute per IP
- Prevents abuse in shared environments