What are stream processing engines?
Stream processing is the processing of data in motion―in other words, computing on data directly as it is produced or received (as opposed to map-reduce databases such as Hadoop, which process data at rest).
Before stream processing emerged as a standard for processing continuous datasets, these streams of data were often stored in a database, a file system, or some other form of mass storage. Applications would then query the stored data or compute over the data as needed. One notable downside of this approach―broadly referred to as batch processing―is the latency between the creation of data and the use of data for analysis or action.
In most stream processing engines users have to write code to create operators, wire them up in a graph and run them. Then the engine runs the graph in parallel.
What are some examples of stream processing engines used in the IoT domain?
Stream processing engines have a narrow usage in IoT – for runtime processing of IoT data streams. They are not designed as a generic rules engine and e.g. cannot actuate back on devices directly.
Some of the most common stream processing engines are Apache Storm, Flink, Samza etc.
Upon receiving an event from a data stream, a stream processing application reacts to the event immediately. The application might trigger an action, update an aggregate, or “remember” the event for future use. Stream processing computations can also handle multiple data streams jointly, and each computation over the event data stream may produce other event data streams.
Stream processing rules engines are typically used for applications such as algorithmic trading, market data analytics, network monitoring, surveillance, e-fraud detection and prevention, clickstream analytics and real-time compliance (anti-money laundering).
Can you model complex logic with stream processing engines?
No high order logic constructions (combining multiple non-binary outcomes, majority voting, conditional executions) are possible with stream rules engines. However, developers can run StreamSQL on top of the datastreams, where simple thresholds together with aggregation across all streams or certain stream subsets can bring great value for some use cases.
How well can stream processing engines deal with the time dimension?
Stream processing engines cannot cope with synchronous and asynchronous events in the same rule. This means that we can’t intercept the stream data and at the same moment call an external API service, while executing the rule. Stream processing engines are designed to focus on the high throughput stream execution, which would, for any API call that has a big round-trip delay for a given event, simply break the processing pipeline.
Still, stream processing engines have a very powerful query language – StreamSQL. StreamSQL queries over streams are generally “continuous”, executing for long periods of time and returning incremental results. These operations include: Selecting from a stream, Stream-Relation Joins, Union and Merge and Windowing and Aggregation operations.
Are stream processing engines explainable?
Unless you are a developer and familiar with Stream SQL, it is impossible as a user to understand the behaviour of any particular rule. We can argue the same for any typical SQL-based solution.
Are stream processing engines adaptable?
API extensions and overall flexibility are weak points of these rules engines. Stream processing engines are data processing pipelines, not meant to be directly integrated with third-party API systems.
How easy is it to operate stream processing engines?
In many IoT stream processing use cases, stream processing is used for global threshold crossing (e.g. send an alarm if temperature of any event is above a threshold) or aggregations (e.g. average temperature in a given region) but any more complicated calculation or per device threshold crossing is extremely hard to achieve. This is why templating, updating rules per device or version updates are very difficult.
Are stream processing engines scalable?
When it comes to real-time large-volume data processing capabilities, nothing can beat stream processing engines, they are the most scalable engines out there today.
This is an excerpt from our latest Benchmark Report on Rules Engines used in the IoT domain. You can have a look at a summary of the results or download the full report over here.