Data Variety and Its Role in Modern Analytics

Jul 26, 2024
Data Analytics Fundamentals Databases

The foundation of analytics lies in data sources. These sources can range from a server folder, database, webpage, to even a wearable device. They may originate from within or outside an organization. The diversity of data lies in how a source categorizes various data types. Structured, semi-structured, and unstructured data possess unique characteristics that form recognizable patterns.

Let’s look at each one more carefully:

Structured data comprises:
- Customer Relationship Management (CRM) system
- Sales transactions database
- Online forms
- Network logs
- Event reservation system
- Inventory management system
Semi-structured data includes:
- CSV
- YAML
- JSON
- XML
- HTML
Unstructured data comprises:
- Documents
- Social media posts
- Videos
- PDFs
- Audio recordings
- Emails
- Clickstream data
- Photos

Now that we understand the various data origins, let’s examine the diverse methods of storing data.

Data storage methods

Data origin types can be categorized by the storage methods employed. Organized and partially structured data find their place in database management systems (DBMS). On the other hand, disorganized data resides in data lakes or object storage systems. The intricacies of unstructured object storage systems will be revealed in a subsequent lesson, while data lakes will be explored in the upcoming course.

Structured Data Storage

Structured data finds its home in a relational database management system (RDBMS). A database, essentially a structured collection of data housed within a computer, receives this type of data. The RDBMS not only imparts organization to the data but also oversees its lifecycle maintenance and facilitates interactions with various processes and systems.

Structured data storage methods

Relational databases house data in interconnected tables. These tables consist of columns and rows, shaped by a schema that dictates the data’s structure and relationships. The primary aim with relational databases is efficient storage.

When it comes to data retrieval, these databases rely on structured query language (SQL). With a multitude of database options available, selecting the right one necessitates a clear grasp of your data type and its intended usage.

Let’s learn about the pros and cons of relational databases:

Pros of Relational Databases
- Proven technology with widespread adoption and easy to work with
- Lower risk factor
- Excellent transactional latency
- Ideal for handling relatively modest data sets
Cons of Relational Databases
- Scalability issues
- Requirement to scale hardware as data volume expands
- Rigidity of fixed schema hindering seamless modifications
- Difficulties in accommodating unstructured data

Semi-structured Data Storage

Semi-structured data typically finds its place in non-relational database systems, occasionally referred to as NoSQL databases. The term “NoSQL” can be misleading; it does not imply that the data stored cannot be queried using SQL. A more accurate interpretation is “not only SQL.”

NoSQL Database Storage Methods

Non-relational databases are designed to efficiently store semi-structured data for swift retrieval and storage. NoSQL databases store data in the form of documents or key-value pairs.

These databases fall into various categories, each tailored to fulfill distinct needs. NoSQL databases operate in a de-normalized fashion, allowing for duplicated data and offering flexibility in terms of internal document structure or lack thereof.

Okay, now let’s learn more about the 2 semi-structured storage methods, key-value and document store

Key-Value Databases

Key-value databases, a type of non-relational database, store unstructured data in key-value pairs.

Conceptually, data resides in a solitary table where values correspond to specific keys. These values exist as blob objects, eliminating the necessity for a predetermined schema. They can encompass a wide range of data types.

Strengths:

Remarkable flexibility
Ability to accommodate diverse data types
Direct linkage of keys to values, minimizing indexing and complex joins
Facilitates easy duplication of key content to other systems without necessitating data reprogramming

Weaknesses:

Challenging to query values due to their single blob storage
Editing or updating value content can be harder than you think
Not all objects align conveniently with the key-value pair model

Document Stores

Document stores, a type of non-relational database, house semi-structured and unstructured data in file format. These files can take the shape of JSON, BSON, or XML, and are accessible through various programming languages like Python and Node.js, Deno, Rust, Go even old school PHP.

Conceptually, files consist of data arranged as a series of elements representing individuals, locations, objects, or occurrences. For instance, a document store could store log files from multiple servers, each file containing system-specific details without consideration for the contents of other system log files.

Strengths:

Exceptional flexibility
Absence of the need to predetermine data type during creation
Easy scalability

Weaknesses:

Inability to query across multiple files

NoSQL Databases Overview

NoSQL databases support multiple querying languages, with a focus on object querying and leveraging compute power. These databases excel in scalability by distributing data across numerous servers or instances. They are frequently employed in OLTP applications for web and mobile platforms. Let’s learn about the strengths and weaknesses of non-relational databases.

NoSQL pros:

Updates schemas easily
Faster development
Minimal downtime
Scales well for big data

NoSQL cons:

Delayed data updates
Poor for low latency apps
Less mature than relational DBs

OLTP and OLAP systems

In databases, two main ways to organize info: OLTP and OLAP. Systems depend on data use: store and retrieve. Adding data = write; Querying data = read. Both use same resources differently, hard to optimize. Small DBs handle write and read together; Large DBs trade off read for write or vice versa.

The solution OLTP optimized for write, OLAP optimized for read. Okay, so let us take a deeper look into OLTP and OLAP. Here is the information in a markdown table

Type	Description
OLTP	OLTP databases, short for Online Transaction Processing databases, predominantly handle lookup queries. These queries typically retrieve entire rows of data and commonly involve filtering based on key columns within the table. For instance, in an OLTP system, you would likely run a query to retrieve specific details related to a particular order.
OLAP	Aggregate queries are commonly used in OLAP systems to combine multiple rows into a single row by summarizing values from one or more columns. OLAP systems are mainly employed for queries involving a limited number of columns. For instance, in this type of system, you might execute a query to find out the total amount of items sold on a specific date.

Storage of data can be either row-based or columnar

Indexing data in a database is crucial for speeding up query performance by enabling quick access to the required data. Indexes determine the physical storage of data on disk by organizing records in a predictable order based on key values in the table. This significantly influences the speed and efficiency of queries.

Both OLTP and OLAP systems can implement either row-based or columnar indexing methods. However, selecting the most suitable method is essential based on the predominant types of queries that will be executed for optimal performance.

Characteristic	Row-based	Columnar
Storage on disk	Row by row	Column by column
Read/write	Best at random reads and writes	Best at sequential reads and writes
Best for	Returning full rows of data based on a key	Returning aggregations of column values
Implementation	Transactional systems	Analytical processing
Data compression	Low to medium compression can be achieved	High compression is the norm

In reality, data is frequently written to the OLTP database. The information is then transferred to an OLAP system at regular intervals. In various scenarios, the data sent to the OLAP database undergoes transformations to offer more concise responses for analytical inquiries or is pre-aggregated to facilitate quick query outcomes.

These processes are typically carried out through extract, transform, and load ETL operations.

Comparing data storage methods

Several factors can assist in selecting the most suitable database type for a new project or program under development. Below are the main characteristics of the two database types that have been discussed.

Characteristic	Row-based	Columnar
Representation	Multiple tables, each containing columns and rows	Single table with keys and values
Data design	Normalized relational or dimensional data warehouse	Denormalized document, wide column, or key-value
Optimized	Optimized for storage	Optimized for compute
Query style	Language: SQL	Language: Many, uses object querying
Scaling	Scale vertically	Scale horizontally
Implementation	OLTP business systems	Analytical systems

Conclusion

Structured data works well with relational databases (RDBMS), offering reliability and efficient transactions but facing challenges in scalability. Semi-structured data is linked to non-relational or NoSQL databases, providing flexibility and scalability with schema modifications but struggling with complex queries.

OLTP and OLAP systems for transactional and analytical processing, highlighting the importance of selecting between row-based and columnar storage for managing queries and workloads effectively. We also learned that to successfully work within data storage you must understanding the application you are building and the data needed. Whether you prioritize relational database reliability or NoSQL versatility for various data and analytics. As you navigate data storage, stay informed about advancements and best practices to ensure your data infrastructure remains resilient, scalable, and meets your evolving needs.

If you enjoyed this article, you might also like THIS Post .