The system became an Apache Beam project: https://beam.apache.org/
This interface can sit on top of different data sources (an ambitious approach).
When you write a paper for a conference you want to put in as many : no idiots statements - don’t say explicitly about what the reviewers might be worried about. Bunch of statements like that in the paper. Non-technical stuff - lots of people work on the problem (streaming systems) - they go through the full stream of systems: Niagra, Esper (the 1st open-source streaming system), Storm, Pulsar, Spark Streaming, Flink, they bucket them and try to say what they think is lacking in the systems - one sentence what the group did right and wrong.
Important excert: “None of these shortcomings are intractable, and systems in active development will likely overcome them in due time. But we believe a major shortcoming of all the models and systems mentioned above (with exception given to CEDR and Trill), is that they focus on input data (unbounded or otherwise) as something which will at some point become complete. We believe this approach is fundamentally flawed when the realities of today’s enormous, highly disordered datasets clash with the semantics and timeliness demanded by consumers. We also believe that any approach that is to have broad practical value across such a diverse and varied set of use cases as those that exist today (not to mention those lingering on the horizon) must provide simple, but powerful, tools for balancing the amount of correctness, latency, and cost appropriate for the specific use case at hand. Lastly, we believe it is time to move beyond the prevailing mindset of an execution engine dictating system semantics; properly designed and built batch, micro-batch, and streaming systems can all provide equal levels of correctness, and all three see widespread use in unbounded data processing today.”
Logical data independence - this is mostly physical model independence. Query optimization. Abstraction - semantic of what is going to happen.
Leaky abstraction. The SQL language abstracts away the procedural steps for querying a database, allowing one to merely define what one wants. But certain SQL queries are thousands of times slower than other logically equivalent queries. On an even higher level of abstraction, ORM systems, which isolate object-oriented code from the implementation of object persistence using a relational database, still force the programmer to think in terms of databases, tables, and native SQL queries as soon as performance of ORM-generated queries becomes a concern.
That’s not a really useful way to think about the unbounded data.
On a given database, think about a database at a given time. More data is continually coming in.
You can be more efficient, in reality, unbounded datasets have been processed using repeated runs of batch systems since their conception, and well-designed streaming systems are perfectly capable of processing bounded data. From the perspective of the model, the distinction of streaming or batch is largely irrelevant, and we thus reserve those terms exclusively for describing runtime execution engines.
Jujutsu (/dʒuːˈdʒuːtsuː/ joo-JOOT-soo; Japanese: 柔術, jūjutsu About this sound listen (help·info)), also known in the West as Ju-Jitsu or Jiu-Jitsu, is a Japanese martial art and a method of close combat for defeating an armed and armored opponent in which one uses either a short weapon or none. In the paper: some Jujutsu moves - take the attack and claim this is exactly what we do.
Another interesting excerpt: “It is lastly worth noting that there is nothing magical about this model. Things which are computationally impractical in existing strongly-consistent batch, micro-batch, streaming, or Lambda Architecture systems remain so, with the inherent constraints of CPU, RAM, and disk left stead-fastly in place.” Of course, there is nothing magical here!!!
In terms of form - no evaluation (industrial paper), incredibly well set up framework in the introduction - it’s slightly cynical (believing that people are motivated by self-interest; distrustful of human sincerity or integrity).
Defensive arguments in the introduction. It’s a very crowded space.
They try to come up with a comprehensive model. Can we modify different types of streaming, window, engine semantics - rather than to pick one and claim that this is how it should be. However, you have to give up some performance.
Understand patterns/anti-patterns for your own research.
MapReduce was rejected by SIGMOD and VLDB. Finally, it was accepted at the OSDI.
Meta-information: about stream processing - we have several requirements - semantics and performance. Can we separate - set them independently - the upper level - a semantic buffer - buffer for the tuples - whether it’s a window. Lower part is an incremental view maintenance. This paper does not have an engine included, Apache Beam has no engine. The engine is closed-source. Given an engine and semantic - engine fault? if not a semantic concept supported. How can you build a better system? Paraphrase: can you really decouple the semantics from the performance with the engine. Non-stream query.
Beam can sit on top of Spark streaming. Spark streaming is built on top of Spark and RDDs.
If Beam become the most popular streaming interface - the typical things that people do with it.
It’s also about missing data.
Incorporate semantic into the underlying engine.
The narrow waist architecture - abstraction layer, the semantic level on top of it.
User facing semantics: usually SQL.
Underlying Relational Algebra with a bunch of extensions.
Different notions - SQL - logical views - the notion of logical consequences.
Logical data independence - through views.
Between relational algebra and engine implementation is the physical independence.
Views hide the underlying database schema.
Plug in a new engine - it should support the relational algebra ++.
When the relational model was proposed. Conventional wisdom - a baeutiful academic exercise - the great debate - very early SIGMOD - relational people - and people who did an ad-hoc stuff. Ingres and system R - we think that it’s a right model. The first implementation of the relational algebra. Some parallels with this and programming language folks. In the initial days - high level languages - you need to write hand tuned assembly - none really does it. With the PL folks and the idea of a high level language. The underlying architecture did not change that much. Problem - most architecture people would disagree with that. The idea of RDD is very different from the traditional tables in databases. Column store is overhaul from the row-store. Look at a query and figure out an optimal course of action. In a long run, abstraction wins out. The hard coded solution breaks. IRS crushes on the tax day - they were running on the 30 old system. Burning platform - it’s gonna run and burning. Looking at these figure. You are in favor of BigDAWG - there is some reality to it. BEAM is a one-size fits all solution.
Multi-core came up - leaky abstraction open MP - parallelize - cache conscious algorithms - you should not worry about the stuff. You have to have a decent system and big market share so that everybody will follow you.
Spark - solve 80% of streaming problems that people have. It’s one of the constant struggles in systems - where do you draw the abstraction. Architecture - at some point, the underlying system is too complicated - the abstraction is too big.
Programmers don’t kmow enough to provide setting to the machine - instead specify to the requirements of the system. There is gonna be some problem with the cache conscious problem.
Fundamental thing on which their system is based is the notion of windows - unbounded stream of data and you need to process it - you have to delineate the blocks of data. Jennifer Widom - http://ilpubs.stanford.edu:8090/758/
CQL (2003), a Continuous Query Language, is supported by the STREAM prototype Data Stream Management System at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and updatable relations. We begin by presenting an abstract semantics that relies only on ``black box’’ mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQL’s query execution plans as well as details of the most important components: operators, inter-operator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for Data Stream Management Systems. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL.
Processing time - these two times are relevant - skew between them changes as the system runs.
Fundamental statement: you can never rely on a notion of completeness. Conceptually - you need to keep every window - every window anybody has asked for.
Retract - when you hit the end of the window - send an anti-result - do the - if a past was 7, and new version is 8, then first send -7 and then they send 8.
Accumulating takes some additional space. The whole thing backs to the application question - clear answer - there is no magic - if data is not there, data is not there. Semantics - some hope of understanding - at any time stuff can show up.
Watermarks - everything up to this point in time already appeared. You can define your own watermark. Accumulating vs. discarding vs. retract and accumulate. Element based windows - account window - every 3 records are windowed - whatever order they come - just group them. The window can be a predicate. You can watch some stock ticker. One quote and the next quote. Unclear whether they support that in their case.
Session semantic - virtual stream over a specific key - per user.
Sessions are windows that capture some period of activity over a subset of the data, in this case per key. Typically they are defined by a timeout gap. Any events that occur within a span of time less than the timeout are grouped together as a session. Sessions are unaligned windows. For example, Window 2 applies to Key 1 only, Window 3 to Key 2 only, and Windows 1 and 4 to Key 3 only.
From the first time I see the user - and the default activity window is 30 minutes, if nothing happens I close the session.
Watermark - the idea of repeat clause. Can combine the late data after the watermark. Some point at which I’m pretty sure/confident I saw close to 100% of the data. Problems, too fast - a lot of data show up after the watermark - exception handling. They might be too slow - one solution to the late data is that you wait - their fundamental assumption is that you never know. Stupid argument - too fast - it doesn’t work very well.
The notion of watermark - adapt over time and percentile watermark triggers in the model.
Streaming used for real time recommendation.
Example section: Processing time (y axis), event time (x axis) - when things really happen, processing time lags event time. They show values instead of when the event happened. Ideal - processing time aligned with the event time. The skew between the ideal time and the processing time. Watermarks are used to trigger the operations on the windows. For the system - how your actual watermark is generated - aggressive enough! Trade-off between how exact you’re and how you generate the results. Confidence - how much - the probability of more data showing up.
6 different ways how we see the streaming system at Google. Big joins of the log tables - the answers were so wrong we couldn’t trust it, sessioning thing, billing (fact that you had to deal with the old data), statistics collection (do not need the exact answer), recommendation engine does not have to be perfect, stock market example - based on the shape of the data.