In this blog post, we will be covering how you can combine and leverage the open-source streaming solution, bytewax, with ydata-profiling, to improve the quality of your streaming flows. Buckle up!
Stream processing enables real-time analysis of data in-flight and before storage, and can be stateful or stateless.
Stateful stream processing is used for real-time recommendations, pattern detection, or complex event processing, where the history of what has happened is required for the processing (windows, joining by a key, etc.).
Stateless stream processing is used for inline transformation that doesn’t require knowledge of other data points in the stream like masking an email or converting a type.
Overall, data streams are widely used in industry and can be found applied to use cases such as fraud detection, patient monitoring, or event predictive maintenance.
Unlike traditional models where data quality is usually assessed during the creation of the data warehouse or dashboard solution, streaming data requires continuous monitoring.
It is essential to maintain data quality throughout the entire process, from collection to feeding downstream applications. After all, the cost of bad data quality can be high for organizations:
“The cost of bad data is an astonishing 15% to 25% of revenue for most companies. (…) Two-thirds of these costs can be eliminated by getting in front on data quality.”
— Thomas C. Redman, author of “Getting in Front on Data Quality”
Throughout this article, we will be show you how you can combine
ydata-profiling to profile and improve the quality of your streaming flows!
Bytewax is an OSS stream processing framework designed specifically for Python developers.
It allows users to build streaming data pipelines and real-time applications with capabilities similar to Flink, Spark, and Kafka Streams, while providing a friendly and familiar interface and 100% compatibility with the Python ecosystem.
Using built-in connectors or existing Python libraries, you can connect to real-time and streaming data sources (Kafka, RedPanda, WebSocket, etc.) and write transformed data out to various downstream systems (Kafka, parquet files, data lakes, etc.).
For the transformations, Bytewax facilitates stateful and stateless transformations with map, windowing, and aggregation methods and comes with familiar features such as recovery and scalability.
Bytewax facilitates a Python first and data-centric experience to data streams and is purposely built for data engineers and data scientists. It allows users to build streaming data pipelines and real-time applications and create customizations necessary to meet their needs without having to learn and maintain JVM-based streaming platforms like Spark or Flink.
Bytewax is well suited for many use cases, namely Embedding Pipelines For Generative AI, Handling Missing Values in Data Streams, Using Language Models in a Streaming Context to Understand Financial Markets, and more. For use case inspiration and more information like documentation, tutorials, and guides, feel free to check the bytewax website.
Data Profiling is key to a successful start of any machine learning task, and refers to the step of thoroughly understanding our data: its structure, behavior, and quality.
In a nutshell, data profiling involves analyzing aspects related to the data’s format and basic descriptors (e.g., number of samples, number/types of features, duplicate values), its intrinsic characteristics (such as the presence of missing data or imbalanced features), and other complicating factors that may arise during data collection or processing (e.g., erroneous values or inconsistent features).
Ensuring high data quality standards is crucial for all domains and organizations, but is especially relevant for domains operating with domains outputting continuous data, where circumstances might change fast and may require immediate action (e.g., healthcare monitoring, stock values, air quality policies).
For many domains, data profiling is used from an exploratory data analysis lens, considering historical data stored in databases. On the contrary, for data streams, data profiling becomes essential for validation and quality control continuously along the stream, where data needs to be checked at different time frames or stages of the process.
By embedding an automated profiling into our data flows, we can immediately get feedback on the current state of our data and be alerted for any potentially critical issues — whether they are related to data consistency and integrity (e.g., corrupted values or changing formats), or to events happening in short periods of time (e.g., data drifts, deviation from business rules and outcomes).
In real-world domains — where you just know Murphy’s law is bound to strike and “everything can definitely go wrong” — automated profiling might save us from multiple brain puzzles and systems needing to be taken out of production!
In what concerns data profiling,
ydata-profiling has consistently been a crowd favorite, either for tabular or time-series data. And no wonder why — it’s one line of code for an extensive set of analysis and insights.
Complex and time-draining operations are done under the hood: ydata-profiling automatically detects the feature types comprised in the data and depending on the feature types (either numeric or categorical) it adjusts the summary statistics and visualizations that are shown in the profiling report.
Fostering a data-centric analysis, the package also highlights the existing relationships between features, focusing on their pairwise interactions and correlations, and provides a thorough evaluation of data quality alerts, from duplicate or constant values to skewed and imbalanced features.
It’s really a 360º view of the quality of our data — with minimal effort.
Before starting the project, we need to first set our python dependencies and configure our data source.
First, let’s install the
ydata-profiling packages (You might want to use a virtual environment for this — check these instructions if you need some extra guidance!)
Then, we’ll upload the Environmental Sensor Telemetry Dataset (License — CC0: Public Domain), which contains several measurements of temperature, humidity, carbon monoxide liquid petroleum gas, smoke, light, and motion from different IoT devices:
In a production environment, these measurements would be continuously generated by each device, and the input would look like what we expect in a streaming platform such as Kafka. In this article, to simulate the context we would find with streaming data, we will read the data from the CSV file one line at a time and create a dataflow using bytewax.
(As a quick side note, a dataflow is essentially a data pipeline that can be described as a directed acyclic graph — DAG)
First, let’s make some necessary imports:
Then, we define our dataflow object. Afterwards, we will use a stateless map method where we pass in a function to convert the string to a datetime object and restructure the data to the format (device_id, data).
The map method will make the change to each data point in a stateless way. The reason we have modified the shape of our data is so that we can easily group the data in the next steps to profile data for each device separately rather than for all of the devices simultaneously.
Now we will take advantage of the stateful capabilities of
bytewax to gather data for each device over a duration of time that we have defined.
ydata-profiling expects a snapshot of the data over time, which makes the window operator the perfect method to use to do this.
ydata-profiling, we are able to produce summarizing statistics for a dataframe which is specified for a particular context. For instance, in our example, we can produce snapshots of data referring to each IoT device or to particular time frames:
After the snapshots are defined, leveraging
ydata-profiling is as simple as calling the
PorfileReport for each of the dataframes we would like to analyze:
In this example we are writing the images out to local files as part of a function in a map method. These could be reported out via a messaging tool or we could save them to some remote storage in the future. Once the profile is complete, the dataflow expects some output so we can use the built-in
StdOutput to print the device that was profiled and the time it was profiled at that was passed out of the profile function in the map step:
There are multiple ways to execute Bytewax dataflows. In this example, we use the same local machine, but Bytewax can also run on multiple Python processes, across multiple hosts, in a Docker container, using a Kubernetes cluster, and more.
In this article, we’ll continue with a local setup, but we encourage you to check our helper tool waxctl which manages Kubernetes dataflow deployments once your pipeline is ready to transition to production.
Assuming we are in the same directory as the file with the dataflow definition, we can run it using:
We can then use the profiling reports to validate the data quality, check for changes in schemas or data formats, and compare the data characteristics between different devices or time windows.
In fact, we can leverage the comparison report functionality that highlights the differences between two data profiles in a straightforward manner, making it easier for us to detect important patterns that need to be investigated or issues that have to be addressed:
Validating data streams is crucial to identify issues in data quality in a continuous manner and compare the state of data across distinct periods of time.
For organizations in healthcare, energy, manufacturing, and entertainment — all working with continuous streams of data — automated profiling is key to establishing data governance best practices, from quality assessment to data privacy.
This requires the analysis of snapshots of data which, as showcased in this article, can be achieved in a seamless way by combining
Bytewax takes care of all the processes necessary to handle and structure data streams into snapshots, which can then be summarized and compared with ydata-profiling through a comprehensive report of data characteristics.
Being able to appropriately process and profile incoming data opens up a plethora of use cases across different domains, from the correction of errors in data schemas and formats to the highlighting and mitigation of additional issues that derive from real-world activities, such as anomaly detection (e.g., fraud or intrusion/threats detection), equipment malfunction, and other events that deviate from the expectations (e.g., data drifts or misalignment with business rules).
Now you’re all set to start exploring your data streams! Let us know what other use cases you find and as always, feel free to drop us a line in the comments, or find us at the Data-Centric AI Community for further questions and suggestions! See you there!