Many people believe NoSQL to be ancient technology. In the world of databases, however, NoSQL is considered a baby — even though it’s been around since the early ’70s. How’s that possible?
Well, NoSQL wasn’t really popular until the late 2000s, when both Google and Amazon put a lot of research and resources into it. Since then, its popularity and usefulness have grown exponentially, to the point where almost every big website and company utilizes NoSQL in some way.
Another common misconception is that NoSQL can be better or worse than its semantic counterpart, SQL. On the contrary, both of these database types are suited for different types of data and thus will never replace or outshine each other.
Without going into too much detail, SQL databases have a predefined schema, while NoSQL databases are dynamic and perfect for unstructured data. NoSQL databases can use a schema, although it’s not mandatory.
With that in mind, today, we’ll have a look at one of the less complex NoSQL database management systems: wide-column stores, also known as column families. This NoSQL model stores that in columns rather than rows. Thus, it’s perfect for queries and less than optimal for large sets of data.
Then, we’ll go through the following explanations, which are crucial to using wide-column stores properly:
- Different NoSQL database management systems
- What are wide-column stores?
- Column family database objects
- Columnar relational models: Advantages and disadvantages
- Queries in OLTP applications
- Queries in OLAP applications
- Key takeaways and how to adapt your approach
Different NoSQL database management systems
First, let’s take a look at the four main NoSQL database management systems. It will help us get a better idea of why column families are so popular:
The simplest type are key-value stores. Redis is one example; every single item is given an attribute name/key and value.
Document databases, such as MongoDB, associate keys with a complex data schema known as a document. Nested documents and key-array/value pairs are containable inside each document.
Graph databases like Neo4j sort network information such as social connections. The collection of nodes (or vertices, i.e., a thing, place, person, category, and so on), each reflecting data (properties), are given labels (edges) establishing the relationship between different nodes.
Wide-column stores structure data around columns rather than rows; HBase and Apache Cassandra are two examples. Typically, column families are supported — multiple columns used in unison in a similar way to relational database tables.
What are wide-column stores?
Wide-column stores use the typical tables, columns, and rows, but unlike relational databases (RDBs), columnal formatting and names can vary from row to row inside the same table. And each column is stored separately on disk.
Columnar databases store each column in a separate file. One file stores only the key column, the other only the first name, the other the ZIP, and so on. Each column in a row is governed by auto-indexing — each functions almost as an index — which means that a scanned/queried columns offset corresponds to the other columnal offsets in that row in their respective files.
Traditional row-oriented storage gives you the best performance when querying multiple columns of a single row. Of course, relational databases are structured around columns that hold very specific information, upholding that specificity for each entry. For instance, let’s take a Customer table. Column values contain Customer names, addresses, and contact info. Every Customer has the same format.
Columnar families are different. They give you automatic vertical partitioning; storage is both column-based and organized by less restrictive attributes. RDB tables are also restricted to row-based storage and deal with tuple storage in rows, accounting for all attributes before moving forward; e.g., tuple 1 attribute 1, tuple 1 attribute 2, and so on — then tuple 2 attribute 1, tuple 2 attribute 2, and so on — in that order. The opposite is columnar storage, which is why we use the term column families.
Note: some columnar systems also have the option for horizontal partitions at default of, say, 6 million rows. When it’s time to run a scan, this eliminates the need to partition during the actual query. Set up your system to sort its horizontal partitions at default based on the most commonly used columns. This minimizes the number of extents containing the values you are looking for.
One useful option if offered — InfiniDB is one example that does — is to automatically create horizontal partitions based on the most recent queries. This eliminates the impact of much older queries that are no longer crucial.
Column family database objects
Families (a database object) contain columns of related information. The object is a tuple made up of a key-value pair where the key is linked to a value, and the value is a set of columns. A family can be one attribute or a set of related attributes.
We can call the first column model an entity/attribute/value table. Inside of an entity (column), there is a value/attribute table. For customer data, you might have the following for the first column option:
|0001||phone number 1||100200300|
Compared to RDBs, attribute/value tables shine when entering the more unique attributes.
Super columns hold the same information but formatted differently.
|Customer ID: 0001|
|pet peeve||pop music|
A super column family and super column merely add a row ID for the first two models so the data can be obtained faster. Use as many super column models as entities. Have them in individual NoSQL tables or compiled as a super column family.
Two main columnar family types
Columnar-type storage can integrate columnar relational models even though they are also considered a part of NoSQL.
Key-value stores and/or Bigtables.
Columnar relational models: Advantages and disadvantages
Columnar relational models allow for improved compression of attributes when stored in an attribute-wise manner. All of the data in each file is of the same data file.
Let’s say you have a few dozen entries that share the same attribute. You could select all tuples through that attribute, then filter it further using an ID range (for example, only tuples with IDs 230 to 910). This compression requires less storage and — more impressively — quicker querying.
As an example, say you were looking for a collection of tuples with a value greater than x. Rather than running the search through all tuples and gathering tuples with a value over x, you simply target the value and skip over any tuples that do not qualify; as such, fewer disk blocks/bytes are checked. Generally, querying is faster if only one attribute is queried.
Each attribute is stored separately into blocks, resulting in a much greater ratio of tuples and attributes that can be searched per disk block search. The decision-making process is quicker. Another related advantage of the columnar relational model is faster joins.
It’s also much easier to add new columns each time you derive new attributes to add to your database. Rather than needing to rebuild enormous tables, columnar databases simply create another file for the new column.
As far as disadvantages go, updates can be inefficient. For example, say you want to update a specific tuple for multiple attributes. RDB models can do this faster. The fact that columnar families group attributes, as opposed to rows of tuples, works against it; it takes more blocks to update multiple attributes than RDBs would need in this case.
If multiple attributes are touched by a join or query, this may also lead to column storage experiencing slower performance (but other factors come into play as well). It’s also slower when deleting rows from columnar systems, as a record needs to be deleted from each of the record files.
Overall, columnar families work well for OLAP (Online Analytical Processing) but not well for OLTP (Online Transactional Processing). Let’s explore OLTP vs. OLAP scenarios in a bit more detail below.
Queries in OLTP applications
Typically, in this instance, single updates are being done on a very small part of the database, such as one or a few account tuples. Nevertheless, they will need to deal with multiple attributes, which will give RDBs an advantage in speed.
John Smith calls customer services, and you can pinpoint his information through his customer ID or phone number. While the phone number might not be unique, it will narrow down which accounts to select from. This is a transactional scenario rather than an analytical one.
So columnar databases are preferable for OLTP systems? Wrong — you should not attempt to do OLTP-type (single-row operation) transactions on columnar databases. When this process is carried out via a row-oriented system, it simply adds a new entry (row) to the end of your table (the last page).
In contrast, columnar systems need to add/append new values to each respective file. The greater the number of rows you have in your database, the more of a killer this will be on performance (don’t do this: batch inserts are a possible fix for inserting lots of data quickly).
Queries in OLAP applications
Typically, if you’re doing queries that look for metadata insights, such as averages of all account values (sum) across the table, columnar databases can access the specific column much faster, and do aggregations and summaries much faster, than our RDB models.
Perhaps you want to know the average age of your male customers. This will typically result in a sequential scan, which is a performance killer. Let’s say you have 100 million rows of entries with 100 columns each. Either you will need to create composite indexes on sex or read all entries to filter for the target data, which could be gigabytes’ or terabytes’ worth of work.
Rather than reading countless rows/columns of tuples — containing tons of data — columnar systems let you narrow down the tuples that you need to investigate by scanning only the two or three columns actually relevant to your query.
Key takeaways and how to adapt your approach
Columnar databases give you improved automation with regards to vertical partitioning (filter out irrelevant columns in your queries — ideal for analytical queries), horizontal partitioning (improve efficiency by eliminating irrelevant extents), better compression, and auto-indexing of columns.
In systems similar to InfiniDB, you will be able to use standard MySQL syntax for most commands. For instance:
insert, and so on. You will find some exceptions, such as the lack of cartesian products and trigger support.
And finally, integrate your knowledge of standard SQL/MySQL with frontend.
LogRocket: Full visibility into your web apps
LogRocket is a frontend application monitoring solution that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.