Anarchy in the UK – Understanding database technologies and why NoSQL became an alternative to relational databases
Disclaimer: This article is written for non-IT people who are interested in this topic And it may contain generalization for better understanding.
During the 70’s, when most people had disco fever and did loads of cocaine, some very influential computer geeks sat in their basement and thought of systematizing data storage on computers. They developed relational databases. While this most likely was no fun, it was quite influential for the following decades and certainly earned them much money (which they could have used on feverish cocaine orgies, but that is not our topic here). To understand what nerds are talking about today with stuff like SQL, NoSQL and ACID compliancy, we first need to explain relational databases.
The short version: A relational database splits up your data into several different tables, which are connected via shared variables. But why would you split up your data? Because it makes our database faster and more flexible! It does so, because it minimizes redundancy. Data is redundant if it is not once, but several times saved in your database. That is bad, because if you were ever to update your data, you would have to do that at multiple points in your database. Also in the 70’s, digital storage space was VERY expensive and redundancy therefore a cost factor. And that is the reason why relational databases split up the data, which is called normalization. But let us make an example here to get a feeling for relational databases.
Let us understand relational databases for Pablo Escobars hypothetical database of smugglers into the US. Pablo has 3 tables in his database: 1 contains the names of his smugglers and their random ID-number. The second table contains their ID-number and the dates of their trips to the US. The third table contains the ID-number and the amount of cocaine each smuggler can transport per trip. All these tables share a relation: They use the ID-number. Pablo can now compute with his database, how much cocaine Fabio (one of his pilots) can get into the US in 1976. In order to compute this, our program would now search the database for the ID-number of Fabio, then check how many trips are planned in 1976 for his ID and multiply this by the amount of cocaine his ID can ship per trip.
If you don’t like to calculate cocaine or want more examples, take a look here!
To do all of this with data, we need a special-purpose programming language alias query language. The most common programming language to code relational databases and request data from them is SQL. Therefore requesting data and adding data to our database is called a SQL query.
So far, so good. But now let’s dive into that shiny new NoSQL stuff
Relational databases are pretty neat, but of course they are not perfect. And because relational databases are not the best choice for every possible situation, new ideas for storing data emerged. These ideas are grouped under the term NoSQL. NoSQL can be faster than relational databases and is definitely more flexible. In the following part, I will explain these two aspects in more detail. As you read through, the pros and cons of NoSQL and relational databases will become evident to you. As you can imagine, this will get quite technical. If you are not interested in that, just skip to the part about NewSQL. Just keep in mind that neither relational databases nor NoSQL are just “better”. It depends on the intended purpose your priorities.
Relational databases are very strict in their structure. As their structure is very important to their concept, they only allow certain data to be saved. For example, a table in a relational database which is designated to storing dates will never store anything else like texts or even just numbers. This is what is commonly referred to as data integrity. NoSQL is more forgiving and allows less strict data to be stored. This makes NoSQL very flexible. If it is not crystal clear from the start of a database project which kind of data shall be stored in the database, a relational database can be quite the hustle as it simply ignores unfit data and doesn’t save them. But, as you can imagine, data integrity is not that great in NoSQL databases. Imagine if you have a form at the doctors office. It requires you to write down since when you have symptom X. Some people will write comments on the side of the paper to include additional information which doesn’t fit the Symptom X. These comments can be quite useful. A relational database would just ignore all the comments, since they do not fit the answer fields (the data model). A NoSQL database can store these comments, but that will make the data more difficult to process.
NoSQL databases can be much faster than relational databases. This speed advantage can develop on the level of a single operation and the system as a whole.
First let’s discuss speed on the level of a single operation. SQL is a powerful language as it is able to make complex queries, which require to go through multiple tables and get data points from all of them to integrate into an answer. This is called a join (because you join information from different tables). Joins are very useful, but require quite much computing. NoSQL technology does not use joins, because it is not relational in its structure. All data points are stored in every place, which is called denormalization (in contrast to normalization in relational databases). That means data is redundant and needs more space, but needs less computation and is therefore faster available.
Now let’s take a look at the speed of the whole system. The key point here is that you can’t upgrade the speed of a system based on a relational databases easily. That means relational databases are not that scalable. You can scale your server up by buying better hardware (vertical scaling), but this will only get you so far - although that is pretty far, more on this topic below. If your data is Big-Data-huge, even the best server will not be able to do all the computing in time. You will need several servers in a row to process all the data (horizontal scaling). But why can’t you distribute work in a relational database between different servers? The reason is the cryptic acronym ACID. Atomicity, Consistency, Isolation and Durability are key aspects of modern relational databases.
- - Atomicity means that every operation that is done in the database has to be successful or fail as a whole. Think of it like this: If you fill out your form at the doctor’s office and you give the nurse your incomplete sheet, she will return it to you with the request to fill it out completely, otherwise she won’t put you in the system. That’s basically Atomicity.
- - Consistency is what we talked about in the data integrity part.
- - Isolation means that several parallel computation processes will not influence each other. This is important if you have a read data and update data parallel. At the moment the data read process is started it will return the original data point, even if the update of the data is started and finished in between.
- - Durability means that all processes are kept track of in a central protocol. Therefore in the case of a hardware failure, every operation can be retrieved even if it was not finished.
Because of the fact that Atomicity and Durability are not easy to keep track off if you split up your database and its processing between different servers, the horizontal scalability of relational databases is not easy. NoSQL databases don’t have these restrictions and therefore can scale better horizontally.
Trying the best of both worlds – NewSQL
Of course there were ambitions to unify the pros of both relational databases and NoSQL. So obviously, ACID compliancy is part of NewSQL, as well as good scalability. The problem here is that this system is expensive and that the databases can’t be that big, otherwise the system will have to trade off either its data consistency across the database (not all data is always up-to-date) or the availability (not everything is available in an instant). Also, the NewSQL systems inherit the lacking flexibility of classic relational database systems. To conclude, NewSQL is a promising development, but not suited to be the go-to-alternative for every situation.
Ok, but where do we stand now?
To talk about the best practices for today's databases, we actually have to look at a different trend, aside from database concepts: In-memory technology. You can read about this hardware trend in another article, but in short it allows data to be stored in the much faster RAM rather than classic HDDs. This allows even computation-heavy relational databases to be quite fast. SAP HANA offers this solution. HANA is accessed easily via SQL and therefore offers a common language and experience basis for database specialists. As the current limit of processable data is around 75 terabytes, this allows vertical scaling to still be an effective measure if only relevant business data is included in the system and less “noise” data.
The main point of this article was to show you that all database systems have their advantages and that in the position to choose, you should always think about your requirements. No technology suits every purpose here. And a last remark on this topicr: Most professional software which works with Big Data does use several databases of different kinds.
Back to the overview on all Big Data topics
Thanks to 8icons for the icons!