Understanding Data Velocity for Data Processing

Jul 23, 2024
Data Analytics Fundamentals Big-Data

As I mentioned in this my last post:Understanding the 5 Vs of Big Data that velocity has todo with the speed data is being processed and analyzed.

Data processing has two parts: Data Collection and Data transformation. Data Collection is when we gather data from multiple sources in to one single place to store. On the other hand Data transformation is the formatting, organizing and controlling of data. Now let’s look dive into these.

When it come to processing data there are two types of data processing: batch and stream.

Batch processing

Batch processing is the way to go when there is a lot of data to process and this data needs to be processed at specific intervals. e.g.. Runing a batch process schedule or whenever a certain volume of data is reached (event). It is usually performed on datasets like server logs, financial data clickstream summaries and fraud reports.

Stream processing

Stream processing is the way to go when your data is generated continuously (like a waterfall). You should use stream processing whenever you need real-time feedback or continuous insights. Stream processing is normally performed on datasets like e-commerce purchase to handles prices or to identify currently trending products. It can also be applied to Internet of Things (IoT) device’s sensor data. or even information from social network .

Data processing velocities

Batch Processing image

Now let’s learn more about the four velocities for processing data.

Scheduled (batch)

Scheduled batch processing is just when data is being processed in very large volume on a regular schedule. It could be once a week or once a day. It’s generally the same amount or data every time data is loaded. This makes these workloads more predictable to work with.

Periodic (batch)

Periodic batch processing is a batch of data that is processed at irregular times these kind of workloads are often run after a certain amount of data has been collected. This can make them unpredictable and harder to plan around.

Near real-time (stream)

Near real-time processing is the streaming of data that is processed in small individual batches.

These batching are continuously collected and the processed withing minutes of the data generation.

Real-time (stream)

Real-time processing is the streaming of data that is processed in very small individual batches.

The batches are continually collected and processed withing millisecond of data generation.

Conclusion

In summary, understanding the velocity of data processing is crucial for optimizing how data is handled and analyzed. There are two primary types of data processing: batch processing, which is suitable for handling large volumes of data at specific intervals, and stream processing, which is ideal for continuous, real-time data flow. Within these categories, there are four specific velocities:

Scheduled batch processing: Regular, predictable intervals.
Periodic batch processing: Irregular intervals, triggered by data volume.
Near real-time processing: Continuous small batches processed within minutes.
Real-time processing: Continuous very small batches processed within milliseconds.

Choosing the right data processing method and velocity is essential to meet the specific needs of various applications, from e-commerce to IoT, ensuring efficient and timely data analysis.

If this article helped you out, you might like: When to Use React or Next.js for Your Project