1 May 2018

Clipper

Clipper:

These are my notes on the paper:

In general, Clipper is similar to BigDAWG. However, instead of serving answers from many databases, it serves predictions from many machine learning models for the task of inference (which is analogous to the query execution in databases). Both systems cater to the users who care about the best performance, in terms of latency and throughput, and have enough resources in terms of hardware and human expertise to deploy many machine learning models in different frameworks, or deploy many database systems and manage them. The manageability cost in both cases is not negligible. The advantage of Clipper lies in its main goal and task - namely, the machine learning inference exposes a much simpler interface, e.g., List<List> predict(List input); whereas the BigDAWG system was designed to accept queries in different forms - spanning from typical SQL (in different dialects) to even array-oriented language. Probably, BigDAWG should expose a simpler interface - e.g. a functional-like one, and then leverage connectors to the database for such language binding (e.g. Scala) - which would simplify the whole translation layer (from a user query - routing it to the destination database engine). Each model container resides in a separate Docker container - the same technique was used in BigDAWG, where separate Database Systems reside in independent Docker containers.

Strong points:

Three crucial properties of the predicting system: low latencies, high throughput, improved accuracy. A good system design with batching, caching and other standard techniques harnessed for the much faster, better (more accurate) predictions. For instance, maintaining a prediction cache, Clipper can serve frequent queries without evaluating the model This reduces latency and system load by eliminating the additional cost of model evaluation. This is vital, since the prediction/inference is bottlenecked on the computation / CPU time - thus the caching for the past predictions gives the biggest bang for the buck. Start-up cost can be reduced by pre-warming the model.
It is claimed that batching can amortize the RPC calls. From the experimental part, we know that this is only a negligible cost. The real cost is the prediction/inference itself on a GPU. The batching mechanism can increase throughput, indeed, significantly because this is how the machine learning algorithms operate internally - on vectors and matrices (use BLAS libraries) - the input should be given as an array of examples - treated internally as a matrix or tensor. So, batching closely match the workload assumptions made by machine learning frameworks. Another overhead mitigated by batching is the cost of copying inputs to GPU memory. The AIMD scheme was used to tune the size of the batch - additively increase the batch size by a fixed amount until the latency to process the batch exceeds the objective, then multiplicatively (by a small percentage) start decreasing the batch size. Batch delay policy - helped only in case of Scikit-Learn - 2ms batch delay provided 3.3X improvement in throughput and the latency remained in the window of 10-20ms objective.
One difference between model ensemble and Clipper is that the ensemble method is focused on improving only the accuracy, whereas Clipper can also boost the performance of the whole system by lowering the latencies. Moreover, it provides mechanisms to easily navigate the trade-offs between accuracy and computation cost on a per-application basis.
I had the idea of extending the streaming system with certainty bounds on the results for the processing windows (e.g. the sliding window) - Clipper gives confidence levels of the predictions - thanks to using many models.
The accuracy of a deployed model can silently degrade over time. Clipper’s online selection policies can automatically detect these failures using feedback and compensate by switching to another model (Exp3) or down-weighting the failing model (Exp4).
Horizontal scaling achieved without sacrificing the latency or accuracy (in terms of parallel systems).

Weak points:

ML model in Clipper is treated as a black-box, for instance, we cannot optimize the execution/inference of the model. On the other hand, TensorFlow serving is able to leverage GPU acceleration and compilation techniques to speedup the inference. Furthermore, TensorFlow Serving tightly couples the model and serving components in the same process.
Even more detrimental effect of the non-transparent layered design is the inability to update the model - the feedback loop operates on the level of choosing already trained models, but the model could be updated and made better based on their predictions and the feedback received from the end application (to re-train the model). Thus, Clipper does not close the loop of: learn -> deploy -> explore -> log -> learn. The feedback (log) is not provided to re-learn the model. If all models are out-of-date - then the accuracy of the predictions returned by Clipper will be miserly.
Some figures are awry - for example, Figure 4. all numbers (labels) are the same in the top and bottom parts of the figure - for denoting the values.
Adaptive model selection occurs above the cache in Clipper, thus changes in predictions due to model selection do not invalidate cache entries. The layers could have been a bit less transparent and we could have something like a down-stream call - to clear the cache from the results of an old model.
Clipper studies multiple optimization techniques to improve the throughput (via batch size) and reduce latency (via caching). It also proposes to do model selection for ensemble modeling using multi-armed bandit algorithms. We could do better by providing both training and inference services together. Clipper optimizes the throughput, latency and accuracy separately (in a step/stage approach), we could model them together to find the optimal model selection and batch size.
Paper does not give a real-world clear application experiments that would show that Clipper does indeed improve the end-to-end inference time.
Techniques used on the technical level are rather regular. The slow inference cannot be solved only on the system level, we need a collaboration with the hardware community to provide specialized accelerators. Interestingly enough, TPUs take a lower precision input to accelerate the computation. How to compress the computational footprint for deep networks and the storage footprint for input data?

Other notes

It would be great to explore how difficult it is to extend TensorFlow Serving tool - we could train many models with TensorFlow and add the feature of serving many models.

Model gets stale because things are changing in the outside environment.
Folklore - different groups doing these stages: data scientists - build / train the model, devops - deploy the model and work at scale; bridge the gap from the batch training to the deployment at scale;
stream vs. batch processing have different requirements
TensorFlow is a moving target - it changes all the time
modular design - does not introduce undue overheads
In TensorFlow, the queues are pushed into the TensorFlow framework
Clipper - scheduling decisions made twice
wrappers - Jim Gray’s paper - transparency in its place
flexibility of the ensembles - abstractions added in Clipper do not add that much overhead
answer queries in the live-way
context - AMP lab - rolling alone - by hand - everybody was building their own databases before
pull out / define new category of software - common software
mem-cache - distributed store - sort of intermediary caching systems - wrappers - REST based API
RDS - model selection / RSS - model abstraction
protocol conversion/ glue
applications/users on the very top
mediator/wrapper architecture
how to abstract a set of different systems or algorithms
how do we deal with machine learning models
how to get a feedback from the system
what’s new here? -
call for papers - SIGMOD 2019 - database community - debate
questions - what constitutes a new systems work - no innovation in the individual boxes - model selection (multi-arm bandit algorithm), adaptive batch size - additive increase, multiplicative decrease - AIMD.
systems work - defining abstractions - TensorFlow serving,
model servers did not exist
various external interfaces - how to think about the abstractions inside
serve predictions with low latency
provide accurate answer
model selection - comes from the machine learning literature
model abstraction - comes from the systems works + scheduling
systems paper biggest advantage - your system can make something much easier than it was in the past
not enough innovation
Spark - fault tolerance - RDD is all about it
what is the emphasis of the paper?
VELOX - many specialized models sit on top of a single/bottom model
select-combine-observe loop - they don’t go into much detail how the things are personalized
2 modes of model selection - pick a single best model - the most accurate answer - they went to the multi-arm bandit - bunch of system that give you answer - which one is giving you the best answer - you explore - try different arms - the slot machine - with the best pay-out - if you don’t try the other ones - you are stuck in local optimum
Is there a new class of systems?
Scale from 4 models to 4000? Can we build it with 1000 models?

Tags:

0 comments