%%writefile batch-config.cfg
1000 }
max_batch_size { value: 1000 }
batch_timeout_micros { value: 16 }
max_enqueued_batches { value: 16 } num_batch_threads { value:
Overwriting batch-config.cfg
According to the docs:
Model Server has the ability to batch requests in a variety of settings in order to realize better throughput. The scheduling for this batching is done globally for all models and model versions on the server to ensure the best possible utilization of the underlying resources no matter how many models or model versions are currently being served by the server. You can enable this by using the
--enable_batching
flag and control it with the--batching_parameters_file
.
This is an example batching parameters file:
The model we are going to serve is generated in this note.
I’m going to start two TF Serving instances, one thats regular CPU and one that does batching on GPU. I’m running both commands from the /home/hamel/tf-serving/
directory.
docker run \
--mount type=bind,source=/home/hamel/hamel/notes/serving/tfserving/model/,target=/models/model \
--net=host -t tensorflow/serving --grpc_max_threads=1000
Test the CPU version:
You must install nvidia-docker first
You can pass additional arguments like --enable_batching
to the docker run ...
command just like you would if you were running tfserving locally.
Note that we need the --gpus all
flag to enable GPUs with nvidia-Docker. Furthermore, use the latest-gpu
tag to enable GPUs as well as the --port
and --rest_api_port
so that it doesn’t conflict with the other tf serving instance I have running:
docker run --gpus all \
--mount type=bind,source=/home/hamel/hamel/notes/serving/tfserving,target=/models \
--net=host -t tensorflow/serving:latest-gpu --enable_batching \
--batching_parameters_file=/models/batch-config.cfg --port=8505 \
--rest_api_port=8506 --grpc_max_threads=1000
Test the TF-Serving GPU api:
“All benchmarks are wrong, some are useful”
We are going to send 5 instances to score 10,000 times and measure the total inference time. We will parallelize the 10,000 requests (each with 5 instances to score) with threads. As a reminder, The model we are going to serve is generated in this note.
from tensorflow import keras
vocab_size = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review
_, (x_val, _) = keras.datasets.imdb.load_data(num_words=vocab_size)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)
sample_data = x_val[:5, :]
data = [sample_data] * 10000
import json, requests
import numpy as np
from fastcore.parallel import parallel
from functools import partial
parallel_pred = partial(parallel, threadpool=True, n_workers=500)
def predict_rest(data, port):
json_data = json.dumps(
{"signature_name": "serving_default", "instances": data.tolist()}
)
url = f"http://localhost:{port}/v1/models/model:predict"
json_response = requests.post(url, data=json_data)
response = json.loads(json_response.text)
rest_outputs = np.array(response["predictions"])
return rest_outputs
array([[0.89650154, 0.10349847],
[0.00330466, 0.9966954 ],
[0.13089457, 0.8691054 ],
[0.49083445, 0.50916553],
[0.0377177 , 0.96228224]])
This is the code that will be used to make gRPC prediction requests. For more discussion about gRPC, see this note
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc
# Create a channel that will be connected to the gRPC port of the container
def predict_grpc(data, input_name='input_1', port='8505'):
options = [('grpc.max_receive_message_length', 100 * 1024 * 1024)]
channel = grpc.insecure_channel(f"localhost:{port}", options=options) # the gRPC port for the GPU server was set at 8505
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
# Create a gRPC request made for prediction
request = predict_pb2.PredictRequest()
# Set the name of the model, for this use case it is "model"
request.model_spec.name = "model"
# Set which signature is used to format the gRPC query
# here the default one "serving_default"
request.model_spec.signature_name = "serving_default"
# Set the input as the data
# tf.make_tensor_proto turns a TensorFlow tensor into a Protobuf tensor
request.inputs[input_name].CopyFrom(tf.make_tensor_proto(data))
# Send the gRPC request to the TF Server
result = stub.Predict(request)
return result
The CPU server is running on port 8501
.
The REST API endpoint on the CPU-bound server.
This is using the same CPU-bound TF Serving server, but is hitting the gRPC endpoint.
The GPU server is running on port 8506
(we already started it above).
This is much faster than the REST endpoint! This is also much faster than the CPU version on this specific example. However, the batching part doesn’t appear to be providing any speedup at all, because the non-batch gRPC version is almost the same speed (if not a little bit faster).
docker run --gpus all --mount type=bind,source=/home/hamel/hamel/notes/serving/tfserving,target=/models --net=host -t tensorflow/serving:latest-gpu --port=8507 --rest_api_port=8508
When I initially did this I got an error that said “Resources Exhausted”. I was able to solve this by increasing the threads with the flag --grpc_max_threads=1000
when running the Docker container.