Alex Williams Alex Williams is a seasoned full-stack developer and the owner of Hosting Data UK. After graduating from the University of London, majoring in IT, Alex worked as a developer leading various projects for clients from all over the world for almost 10 years. Recently, Alex switched to being an independent IT consultant and started his own blog. There, he explores web development, data management, digital marketing, and solutions for online business owners just starting out.

NoSQL wide-column stores demystified

6 min read 1787

NoSQL Wide-column Stores Demystified

Many people believe NoSQL to be ancient technology. In the world of databases, however, NoSQL is considered a baby — even though it’s been around since the early ’70s. How’s that possible?

Well, NoSQL wasn’t really popular until the late 2000s, when both Google and Amazon put a lot of research and resources into it. Since then, its popularity and usefulness have grown exponentially, to the point where almost every big website and company utilizes NoSQL in some way.

Another common misconception is that NoSQL can be better or worse than its semantic counterpart, SQL. On the contrary, both of these database types are suited for different types of data and thus will never replace or outshine each other.

Without going into too much detail, SQL databases have a predefined schema, while NoSQL databases are dynamic and perfect for unstructured data. NoSQL databases can use a schema, although it’s not mandatory.

With that in mind, today, we’ll have a look at one of the less complex NoSQL database management systems: wide-column stores, also known as column families. This NoSQL model stores that in columns rather than rows. Thus, it’s perfect for queries and less than optimal for large sets of data.

Then, we’ll go through the following explanations, which are crucial to using wide-column stores properly:

Different NoSQL database management systems

First, let’s take a look at the four main NoSQL database management systems. It will help us get a better idea of why column families are so popular:

1. keyvaluestores

The simplest type are key-value stores. Redis is one example; every single item is given an attribute name/key and value.

2. documentdatabases

Document databases, such as MongoDB, associate keys with a complex data schema known as a document. Nested documents and key-array/value pairs are containable inside each document.

3. graphdatabases

Graph databases like Neo4j sort network information such as social connections. The collection of nodes (or vertices, i.e., a thing, place, person, category, and so on), each reflecting data (properties), are given labels (edges) establishing the relationship between different nodes.

We made a custom demo for .
No really. Click here to check it out.

4. widecolumnstores

Wide-column stores structure data around columns rather than rows; HBase and Apache Cassandra are two examples. Typically, column families are supported — multiple columns used in unison in a similar way to relational database tables.

What are wide-column stores?

Wide-column stores use the typical tables, columns, and rows, but unlike relational databases (RDBs), columnal formatting and names can vary from row to row inside the same table. And each column is stored separately on disk.

Columnar databases store each column in a separate file. One file stores only the key column, the other only the first name, the other the ZIP, and so on. Each column in a row is governed by auto-indexing — each functions almost as an index — which means that a scanned/queried columns offset corresponds to the other columnal offsets in that row in their respective files.

Illustrating the Concept of a Columnar Database

Traditional row-oriented storage gives you the best performance when querying multiple columns of a single row. Of course, relational databases are structured around columns that hold very specific information, upholding that specificity for each entry. For instance, let’s take a Customer table. Column values contain Customer names, addresses, and contact info. Every Customer has the same format.

Columnar families are different. They give you automatic vertical partitioning; storage is both column-based and organized by less restrictive attributes. RDB tables are also restricted to row-based storage and deal with tuple storage in rows, accounting for all attributes before moving forward; e.g., tuple 1 attribute 1, tuple 1 attribute 2, and so on — then tuple 2 attribute 1, tuple 2 attribute 2, and so on — in that order. The opposite is columnar storage, which is why we use the term column families.

Note: some columnar systems also have the option for horizontal partitions at default of, say, 6 million rows. When it’s time to run a scan, this eliminates the need to partition during the actual query. Set up your system to sort its horizontal partitions at default based on the most commonly used columns. This minimizes the number of extents containing the values you are looking for.

One useful option if offered — InfiniDB is one example that does — is to automatically create horizontal partitions based on the most recent queries. This eliminates the impact of much older queries that are no longer crucial.

Column family database objects

Families (a database object) contain columns of related information. The object is a tuple made up of a key-value pair where the key is linked to a value, and the value is a set of columns. A family can be one attribute or a set of related attributes.

Illustrating Column Family Database Objects

We can call the first column model an entity/attribute/value table. Inside of an entity (column), there is a value/attribute table. For customer data, you might have the following for the first column option:

Customer ID Attribute Value
0001 name Jane Doe
0001 phone number 1 100200300
0001 email janedoe@jd.com

Compared to RDBs, attribute/value tables shine when entering the more unique attributes.

Customer ID —————– 0001
0001 pet peeve —————–
hobby pop music
Attribute sewing
—————– Value

Super columns hold the same information but formatted differently.

Customer ID: 0001
Attribute Value
—————– —————–
pet peeve pop music
hobby sewing

A super column family and super column merely add a row ID for the first two models so the data can be obtained faster. Use as many super column models as entities. Have them in individual NoSQL tables or compiled as a super column family.

Two main columnar family types

1. columnarrelationalmodels

Columnar-type storage can integrate columnar relational models even though they are also considered a part of NoSQL.

2. keyvaluestores

Key-value stores and/or Bigtables.

Columnar relational models: Advantages and disadvantages

Advantages

Columnar relational models allow for improved compression of attributes when stored in an attribute-wise manner. All of the data in each file is of the same data file.

Let’s say you have a few dozen entries that share the same attribute. You could select all tuples through that attribute, then filter it further using an ID range (for example, only tuples with IDs 230 to 910). This compression requires less storage and — more impressively — quicker querying.

As an example, say you were looking for a collection of tuples with a value greater than x. Rather than running the search through all tuples and gathering tuples with a value over x, you simply target the value and skip over any tuples that do not qualify; as such, fewer disk blocks/bytes are checked. Generally, querying is faster if only one attribute is queried.

Illustrating the Advantage of Querying Columnar Relational Databases

Each attribute is stored separately into blocks, resulting in a much greater ratio of tuples and attributes that can be searched per disk block search. The decision-making process is quicker. Another related advantage of the columnar relational model is faster joins.

It’s also much easier to add new columns each time you derive new attributes to add to your database. Rather than needing to rebuild enormous tables, columnar databases simply create another file for the new column.

Disadvantages

As far as disadvantages go, updates can be inefficient. For example, say you want to update a specific tuple for multiple attributes. RDB models can do this faster. The fact that columnar families group attributes, as opposed to rows of tuples, works against it; it takes more blocks to update multiple attributes than RDBs would need in this case.

If multiple attributes are touched by a join or query, this may also lead to column storage experiencing slower performance (but other factors come into play as well). It’s also slower when deleting rows from columnar systems, as a record needs to be deleted from each of the record files.

Overall, columnar families work well for OLAP (Online Analytical Processing) but not well for OLTP (Online Transactional Processing). Let’s explore OLTP vs. OLAP scenarios in a bit more detail below.

Queries in OLTP applications

Typically, in this instance, single updates are being done on a very small part of the database, such as one or a few account tuples. Nevertheless, they will need to deal with multiple attributes, which will give RDBs an advantage in speed.

John Smith calls customer services, and you can pinpoint his information through his customer ID or phone number. While the phone number might not be unique, it will narrow down which accounts to select from. This is a transactional scenario rather than an analytical one.

So columnar databases are preferable for OLTP systems? Wrong — you should not attempt to do OLTP-type (single-row operation) transactions on columnar databases. When this process is carried out via a row-oriented system, it simply adds a new entry (row) to the end of your table (the last page).

In contrast, columnar systems need to add/append new values to each respective file. The greater the number of rows you have in your database, the more of a killer this will be on performance (don’t do this: batch inserts are a possible fix for inserting lots of data quickly).

Queries in OLAP applications

Typically, if you’re doing queries that look for metadata insights, such as averages of all account values (sum) across the table, columnar databases can access the specific column much faster, and do aggregations and summaries much faster, than our RDB models.

Perhaps you want to know the average age of your male customers. This will typically result in a sequential scan, which is a performance killer. Let’s say you have 100 million rows of entries with 100 columns each. Either you will need to create composite indexes on sex or read all entries to filter for the target data, which could be gigabytes’ or terabytes’ worth of work.

Rather than reading countless rows/columns of tuples — containing tons of data — columnar systems let you narrow down the tuples that you need to investigate by scanning only the two or three columns actually relevant to your query.

Key takeaways and how to adapt your approach

Columnar databases give you improved automation with regards to vertical partitioning (filter out irrelevant columns in your queries — ideal for analytical queries), horizontal partitioning (improve efficiency by eliminating irrelevant extents), better compression, and auto-indexing of columns.

In systems similar to InfiniDB, you will be able to use standard MySQL syntax for most commands. For instance: create table, select, insert, and so on. You will find some exceptions, such as the lack of cartesian products and trigger support.

And finally, integrate your knowledge of standard SQL/MySQL with frontend.

: Full visibility into your web apps

LogRocket is a frontend application monitoring solution that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.

In addition to logging Redux actions and state, LogRocket records console logs, JavaScript errors, stacktraces, network requests/responses with headers + bodies, browser metadata, and custom logs. It also instruments the DOM to record the HTML and CSS on the page, recreating pixel-perfect videos of even the most complex single-page apps.

.
Alex Williams Alex Williams is a seasoned full-stack developer and the owner of Hosting Data UK. After graduating from the University of London, majoring in IT, Alex worked as a developer leading various projects for clients from all over the world for almost 10 years. Recently, Alex switched to being an independent IT consultant and started his own blog. There, he explores web development, data management, digital marketing, and solutions for online business owners just starting out.

Leave a Reply