Skip to content

Deploy as HTTP API

MLOps Desktop includes a built-in HTTP server for model serving. Deploy any trained model and get predictions via REST API—no additional infrastructure needed.

Time to complete: ~10 minutes

  • A trained model (complete the Quickstart first)
  • Python packages: pip install fastapi uvicorn slowapi
  • Optional for ONNX: pip install onnxruntime
  1. Open the Serving tab

    In the Output Panel at the bottom, click the Serving tab.

  2. Select a model

    Choose from:

    • Registered models from the Models tab
    • Latest trained model from the current session

    Select a specific version if multiple exist.

  3. Configure the server

    Click the Configure button (gear icon):

    SettingDefaultDescription
    Host0.0.0.0Listen address
    Port8000HTTP port
    Use ONNX RuntimeOffEnable for faster inference
  4. Start the server

    Click Start Server.

    Status changes: Stopped → Starting → Running

    You’ll see the server URL: http://localhost:8000

  5. Make predictions

    The API is now ready. Use curl, Python, or any HTTP client.

Once running, the server provides these endpoints:

Terminal window
curl http://localhost:8000/health

Response:

{"status": "healthy", "model": "RandomForestClassifier"}
Terminal window
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"features": [[5.1, 3.5, 1.4, 0.2]]}'

Response (classification):

{
"predictions": [0],
"probabilities": [[0.98, 0.01, 0.01]]
}

Response (regression):

{
"predictions": [24.5]
}

Send multiple samples:

Terminal window
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{
"features": [
[5.1, 3.5, 1.4, 0.2],
[6.2, 3.4, 5.4, 2.3],
[4.9, 2.5, 4.5, 1.7]
]
}'

FastAPI provides automatic docs:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

The Serving tab shows real-time metrics:

MetricDescription
Total RequestsCount since server start
Success RatePercentage of 2xx responses
Avg LatencyMean response time
Requests/minCurrent throughput

A table shows recent requests:

TimeMethodPathStatusLatencyBatch Size
10:30:15POST/predict20012ms1
10:30:18POST/predict20045ms100
10:30:22GET/health2002ms-

Enable ONNX for faster inference:

  1. Install onnxruntime: pip install onnxruntime
  2. Export your model as ONNX (in ModelExporter node)
  3. In Serving config, enable Use ONNX Runtime
  4. Select the .onnx model file
import requests
# Single prediction
response = requests.post(
"http://localhost:8000/predict",
json={"features": [[5.1, 3.5, 1.4, 0.2]]}
)
result = response.json()
print(f"Predicted class: {result['predictions'][0]}")
print(f"Confidence: {max(result['probabilities'][0]):.1%}")
# Batch prediction
import pandas as pd
df = pd.read_csv("new_data.csv")
features = df.drop(columns=["target"]).values.tolist()
response = requests.post(
"http://localhost:8000/predict",
json={"features": features}
)
predictions = response.json()["predictions"]
const response = await fetch('http://localhost:8000/predict', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
features: [[5.1, 3.5, 1.4, 0.2]]
})
});
const { predictions, probabilities } = await response.json();
console.log(`Predicted: ${predictions[0]}`);
StateDescription
StoppedServer not running
StartingLoading model, initializing FastAPI
RunningAccepting requests
StoppingGraceful shutdown

Click Stop Server to shut down gracefully.

For production deployment outside MLOps Desktop:

Use the ModelExporter node to save:

  • .joblib — Python sklearn applications
  • .onnx — Cross-platform, optimized inference
  • .pkl — Python native (security risk)

The server requires these packages:

PackagePurposeRequired
fastapiWeb frameworkYes
uvicornASGI serverYes
slowapiRate limitingYes
onnxruntimeONNX inferenceOptional

Install all:

Terminal window
pip install fastapi uvicorn slowapi onnxruntime

The Serving tab checks for dependencies and shows warnings if any are missing.

Terminal window
pip install fastapi uvicorn slowapi

Change the port in server configuration, or stop the existing process:

Terminal window
lsof -i :8000 # Find process
kill -9 <PID> # Stop it
  • Enable ONNX Runtime for faster inference
  • Use batch predictions instead of single requests
  • Check model complexity (large Random Forest = slow)

The server allows all origins by default. If still having issues, check your request headers.

The server includes basic rate limiting via slowapi:

  • Default: 100 requests/minute per IP
  • Prevents abuse in shared environments