from fastapi import FastAPI, status
from pydantic import BaseModel
from typing import List
import tensorflow as tf
import numpy as np
def load_model(model_path='/home/hamel/hamel/notes/serving/tfserving/model/1'):
"Load the SavedModel Object."
= tf.saved_model.load(model_path)
sm return sm.signatures["serving_default"] # this is the default signature when you save a model
FastAPI
FastAPI is a web framework for Python. People like to use this framework for serving prototypes of ML models.
Impressions
- Model serving frameworks (TF Serving, TorchServe, etc) are probably the way to go for production / enterprise deployments, especially for larger models. They offer more features, and latency will be more predictable (even if slower). I think that for smaller models (< 200MB) FastAPI is fine.
- It is super easy to get started with FastAPI.
- I was able to confirm Sayak’s Benchmark where FastAPI is faster than TF Serving, but also less consistent overall. FastAPI is also more likely to fail, although I haven’t been able to cause that. In my experiments FastAPI was much faster for this small model, but this could change with larger models.
- Memory is consumed linearly as you increase the number of
Uvicorn
workers. Model serving frameworks like TF-Serving seem to work more efficiently. You should be careful to set the environment variableTF_FORCE_GPU_ALLOW_GROWTH=true
if you are running inference on GPUs. I think in many cases you would be doing inference on CPUs, so this might not be relevant most of the time. - FastAPI seems like it could be really nice on smaller models and scoped hardware where there is only one worker per node and you load balance across nodes (because you aren’t replicating the model with each worker).
- Debugging FastAPI is amazing, as its pure python and you get a nice docs page at
http://<IP>/docs
that lets you test out your endpoints right on the page! The documentation for FastPI is also amazing. - If you want the request parameters to be sent in the body (as you often do with ML b/c you want to send data to be scored), you have to use Pydantic. This is very opinionated, but easy enough to use.
Load Model & Make Predictions
Going to use the model trained in the TF Serving tutorial. Furthermore, we are going to load this from the SavedModel format.
def pred(model: tf.saved_model, data:np.ndarray, pred_layer_nm='dense_3'):
"""
Make a prediction from a SavedModel Object. `pred_layer_nm` is the last layer that emits logits.
https://www.tensorflow.org/guide/saved_model
"""
= tf.convert_to_tensor(data, dtype='int32')
data = model(data)
preds return preds[pred_layer_nm].numpy().tolist()
Test Data
= tf.keras.datasets.imdb.load_data(num_words=20000)
_, (x_val, _) = tf.keras.preprocessing.sequence.pad_sequences(x_val, maxlen=200)[:2, :] x_val
Make a prediction
= load_model()
model 2, :]) pred(model, x_val[:
[[0.8761785626411438, 0.12382148206233978],
[0.0009457750129513443, 0.9990542531013489]]
Build The FastApi App
= FastAPI()
app
= {}
items
@app.on_event("startup")
async def startup_event():
"Load the model on startup https://fastapi.tiangolo.com/advanced/events/"
'model'] = load_model()
items[
@app.get("/")
def health(status_code=status.HTTP_200_OK):
"A health-check endpoint"
return 'Ok'
We want to send the data for prediction in the Request Body (not with path parameters). According the docs:
FastAPI will recognize that the function parameters that match path parameters should be taken from the path, and that function parameters that are declared to be Pydantic models should be taken from the request body.
class Sentence(BaseModel):
int]]
tokens: List[List[
@app.post("/predict")
def predict(data:Sentence, status_code=status.HTTP_200_OK):
= pred(items['model'], data.tokens)
preds return preds
Recap: the FastAPI App
Let’s look at main.py
with all the pieces combined:
Code to display source code in Quarto
#This is a hack for Quarto for generated scripts
from IPython.display import display, Markdown
= !cat main.py
code
'```{.python filename="main.py"}\n' + '\n'.join(code) + '\n```')) display(Markdown(
main.py
# AUTOGENERATED! DO NOT EDIT! File to edit: index.ipynb.
# %% auto 0
= ['app', 'items', 'load_model', 'pred', 'startup_event', 'health', 'Sentence', 'predict']
__all__ # %% index.ipynb 3
from fastapi import FastAPI, status
from pydantic import BaseModel
from typing import List
import tensorflow as tf
import numpy as np
def load_model(model_path='/home/hamel/hamel/notes/serving/tfserving/model/1'):
"Load the SavedModel Object."
= tf.saved_model.load(model_path)
sm return sm.signatures["serving_default"] # this is the default signature when you save a model
# %% index.ipynb 4
def pred(model: tf.saved_model, data:np.ndarray, pred_layer_nm='dense_3'):
"""
Make a prediction from a SavedModel Object. `pred_layer_nm` is the last layer that emits logits.
https://www.tensorflow.org/guide/saved_model
"""
= tf.convert_to_tensor(data, dtype='int32')
data = model(data)
preds return preds[pred_layer_nm].numpy().tolist()
# %% index.ipynb 10
= FastAPI()
app = {}
items @app.on_event("startup")
async def startup_event():
"Load the model on startup https://fastapi.tiangolo.com/advanced/events/"
'model'] = load_model()
items[@app.get("/")
def health(status_code=status.HTTP_200_OK):
"A health-check endpoint"
return 'Ok'
# %% index.ipynb 12
class Sentence(BaseModel):
int]]
tokens: List[List[@app.post("/predict")
def predict(data:Sentence, status_code=status.HTTP_200_OK):
= pred(items['model'], data.tokens)
preds return preds
Run The App
We can run the app with the command:
uvicorn main:app --host 0.0.0.0 --port 5701
main
corresponds to the filemain.py
app
corresponds to theapp
object insidemain.py
-app = FastAPI()
--reload
: makes the server restart if the code changes, for development only
import requests, json
def predict_rest(json_data, url='http://localhost:5701/predict'):
= requests.post(url, json={'tokens': json_data})
json_response return json.loads(json_response.text)
predict_rest(x_val.tolist())
[[0.8761785626411438, 0.12382148206233978],
[0.0009457750129513443, 0.9990542531013489]]
Load Test FastAPI
It’s really fast
from fastcore.parallel import parallel
from functools import partial
= partial(parallel, threadpool=True, n_workers=500) parallel_pred
= [x_val.tolist()] * 1000 sample_data
%%time
= parallel_pred(predict_rest, sample_data) results
CPU times: user 2.29 s, sys: 252 ms, total: 2.54 s
Wall time: 2.38 s
Adding Uvicorn Workers
Uvicorn also has an option to start and run several worker processes. Nevertheless, as of now, Uvicorn’s capabilities for handling worker processes are more limited than Gunicorn’s. So, if you want to have a process manager at this level (at the Python level), then it might be better to try with Gunicorn as the process manager.
You can add Uvicorn workers with the --workers
flag:
uvicorn main:app --host 0.0.0.0 --port 5701 --workers 8
GPUs
When I scaled up to 8 workers on a GPU, I got OOM errors. In order to avoid this you want to Limit GPU memory growth by settting the TF_FORCE_GPU_ALLOW_GROWTH
to true
:
TF_FORCE_GPU_ALLOW_GROWTH=true uvicorn main:app --host 0.0.0.0 --port 5701 --workers 8
From the docs:
By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process.
This means if you are running on GPUs and have > 1 worker, you will get OOM workers without setting this env variable!
%%time
= parallel_pred(predict_rest, sample_data) results
CPU times: user 2.26 s, sys: 294 ms, total: 2.55 s
Wall time: 2.34 s
Scaling up workers didn’t have any effect in this particular instance. Could be because the low latency of the model I’m using doesn’t challenge the throughput enough.