In general, Clipper is similar to BigDAWG. However, instead of serving answers from many databases, it serves predictions from many machine learning models for the task of inference (which is analogous to the query execution in databases). Both systems cater to the users who care about the best performance, in terms of latency and throughput, and have enough resources in terms of hardware and human expertise to deploy many machine learning models in different frameworks, or deploy many database systems and manage them. The manageability cost in both cases is not negligible. The advantage of Clipper lies in its main goal and task - namely, the machine learning inference exposes a much simpler interface, e.g., List<List
Three crucial properties of the predicting system: low latencies, high throughput, improved accuracy. A good system design with batching, caching and other standard techniques harnessed for the much faster, better (more accurate) predictions. For instance, maintaining a prediction cache, Clipper can serve frequent queries without evaluating the model This reduces latency and system load by eliminating the additional cost of model evaluation. This is vital, since the prediction/inference is bottlenecked on the computation / CPU time - thus the caching for the past predictions gives the biggest bang for the buck. Start-up cost can be reduced by pre-warming the model.
It is claimed that batching can amortize the RPC calls. From the experimental part, we know that this is only a negligible cost. The real cost is the prediction/inference itself on a GPU. The batching mechanism can increase throughput, indeed, significantly because this is how the machine learning algorithms operate internally - on vectors and matrices (use BLAS libraries) - the input should be given as an array of examples - treated internally as a matrix or tensor. So, batching closely match the workload assumptions made by machine learning frameworks. Another overhead mitigated by batching is the cost of copying inputs to GPU memory. The AIMD scheme was used to tune the size of the batch - additively increase the batch size by a fixed amount until the latency to process the batch exceeds the objective, then multiplicatively (by a small percentage) start decreasing the batch size. Batch delay policy - helped only in case of Scikit-Learn - 2ms batch delay provided 3.3X improvement in throughput and the latency remained in the window of 10-20ms objective.
One difference between model ensemble and Clipper is that the ensemble method is focused on improving only the accuracy, whereas Clipper can also boost the performance of the whole system by lowering the latencies. Moreover, it provides mechanisms to easily navigate the trade-offs between accuracy and computation cost on a per-application basis.
I had the idea of extending the streaming system with certainty bounds on the results for the processing windows (e.g. the sliding window) - Clipper gives confidence levels of the predictions - thanks to using many models.
The accuracy of a deployed model can silently degrade over time. Clipper’s online selection policies can automatically detect these failures using feedback and compensate by switching to another model (Exp3) or down-weighting the failing model (Exp4).
Horizontal scaling achieved without sacrificing the latency or accuracy (in terms of parallel systems).
ML model in Clipper is treated as a black-box, for instance, we cannot optimize the execution/inference of the model. On the other hand, TensorFlow serving is able to leverage GPU acceleration and compilation techniques to speedup the inference. Furthermore, TensorFlow Serving tightly couples the model and serving components in the same process.
Even more detrimental effect of the non-transparent layered design is the inability to update the model - the feedback loop operates on the level of choosing already trained models, but the model could be updated and made better based on their predictions and the feedback received from the end application (to re-train the model). Thus, Clipper does not close the loop of: learn -> deploy -> explore -> log -> learn. The feedback (log) is not provided to re-learn the model. If all models are out-of-date - then the accuracy of the predictions returned by Clipper will be miserly.
Some figures are awry - for example, Figure 4. all numbers (labels) are the same in the top and bottom parts of the figure - for denoting the values.
Adaptive model selection occurs above the cache in Clipper, thus changes in predictions due to model selection do not invalidate cache entries. The layers could have been a bit less transparent and we could have something like a down-stream call - to clear the cache from the results of an old model.
Clipper studies multiple optimization techniques to improve the throughput (via batch size) and reduce latency (via caching). It also proposes to do model selection for ensemble modeling using multi-armed bandit algorithms. We could do better by providing both training and inference services together. Clipper optimizes the throughput, latency and accuracy separately (in a step/stage approach), we could model them together to find the optimal model selection and batch size.
Paper does not give a real-world clear application experiments that would show that Clipper does indeed improve the end-to-end inference time.
Techniques used on the technical level are rather regular. The slow inference cannot be solved only on the system level, we need a collaboration with the hardware community to provide specialized accelerators. Interestingly enough, TPUs take a lower precision input to accelerate the computation. How to compress the computational footprint for deep networks and the storage footprint for input data?
It would be great to explore how difficult it is to extend TensorFlow Serving tool - we could train many models with TensorFlow and add the feature of serving many models.