Published at 06.11.2015
Welcome back in the database land. Today we will examine another one of the document based databases. The first questions is: What is different between Mongo and Couch? Well, a lot. Even though they use the same way of storing data, JSON structures, both are more like cousins than siblings. CouchDB is sometimes also called the – Database made of the internet. We will soon see, what it means.
Table of Contents
As you can see, Couch is an AP database, but what does it mean?
In comparison to Mongo, Couch decided to prefer availability over consistency. The database is deeper in “eventual consistency” than Mongo. It works like Git. When updating a document, couch appends the changed document to the database. This way you can go back some versions if something went wrong. This is called Multi Version Concurrency (MVCC) and it’s a core feature of the database. It is like going back 3 commits in Git to find out what you did last week.
Let us look, how Couch works under the hood. Someone produces an update or a new Document in your collection. To prevent reads, locks – Couch appends the new data to the end of the database file instead of locking the document. If one of your applications is reading from the database, while another’s writing, then the reading instance is reading the document, that was already there while the other instance is appending the changes.
As soon as it is appended, the reading instance is reading the consistent state again. When such a stale reads are not a problem for your application, Couch can handle it. Because of this, the database is also able to prevent inconsistencies when crashing. It just looks at the end to find the last complete written document. The database is so much designed for handling crashes, that it has no graceful exit. It just crashes.
Probably the coolest feature of CouchDB is, that you can bring the database to the user. Normally database replication is a one way street. A primary node syncing to the secondaries. But CouchDB can do that in both directions at the same time.
So, how close can you bring a database to the user? How about into his machine? Even into a web browser. So you can sync a database everywhere. Your user can use the database when he is offline, and then sync all changes over when network is available again. You can define a subset of all data that should be replicated to save space.
Sure, your user won’t see database changes from others while offline, but at least work is possible. Even in the era of cheap broadband internet, users can be offline pretty fast. Construction workers cutting a cable, a restarting router, a crashing server. If your users need the database to work and can’t reach it, then your users have nothing to do until network is restored. So prepare yourself for such scenarios by distributing the database in a communist fashion. Everyone gets the parts of database that they need. Since the database supports bidirectional replications, everyone has the same data eventually.
You probably thought about the problems that arise, when you apply the Communist Manifesto to your database. How do you get all of them to be the same. Well, that is in your power. In case of conflicts, the database will decide a winner document, which the views will see.
But the losing documents will still be there. You can merge between the winner and losers either manually or with an automated Agent. If you have multiple databases that replicate, then all databases in that replica will show the same document as winning. So you don’t get awkward moments, where differing users see different winners.
Most databases have a binary driver which handles communications with the database. It’s a weird thing since sometimes the drivers are not able to use the full feature set of the database. Take an Oracle SQL server as example. You can define PL/SQL functions on the database that take a boolean as parameter. The driver allows you to call functions. Except it won’t let you call this one, since the database itself knows no booleans, only PL/SQL.
So the driver does not know it either. Now you have to convert the boolean into an integer, send it over, convert it back again and call the function through a wrapper. Not very neat, is it?
Couch uses REST as API, you can access the whole feature set of CouchDB with clients like curl. You have the default CRUD operations of REST available. You can insert data into your database by sending a JSON document to it and the database will answer you with JSON. You can see if your request was successful by simply looking at the response header. The database sends default HTTP headers back to you.
Well first, lacking strong consistency can be a game killer for Couch. You would not want an eventually consistent bank account. Another point to keep in mind is, that writes are only ACID compliant on document level, not when updating multiple documents. When you don’t pay attention, how much you replicate to your users’ local database you can get angry mails why that application is so huge.
The marketing is generally a bit lacking, which leads to a lack of case studies about the database, which is more of a fail on the Apache site than a lack in the database. The last point is merging, which, depending on your data structure can be far from trivial. Alice changes color to blue, Bob to red, how do you merge color in that case?
• The HTTP API makes it easy to debug, since HTTP is pretty simple and well understood.
• MVCC makes locks unnecessary.
• Bidirectional Replication
• Partial Replications, never take more data with you than necessary
• Pretty much every language can read and produce JSON.
• The database isolates failures and prevents them from cascading into certain doom.
• When you start it, you get “Time to Relax” as indicator that the database started.