Published at 25.09.2015
Time to take a look at a database, implemented in Java. Yes, Java! In a contrary to the popular belief, it is possible to produce a fast Java application. Let’s get specific! You probably heard about graphs. If not, check out this Wikipedia entry. Entities are represented by nodes, relations by edges. This allows you to quickly access relations between the entities. For example, all friends of friends are easy to find, since you only have to look at all the entities connected via “friend” edge and do that again for your friends. The result is a collection of friends of friends. Without using a single join table. The database is on the AC side of the CAPTheorem, together with your usual RDBMS.
Table of Contents
A property graph probably reminds you of an object-model or an entity relationship diagram. Properties are key:value pairs that can be attached to nodes or relations. You can also label nodes or attach metadata like indexes to them. Relationships also never point to nothing. If a node gets deleted, all attached relations get deleted too. So you never have to make sure, that there really is something on the other side of the relation. Relations have a direction, but are traversable from both sides. Two nodes can have multiple relations pointing to each other without a performance hit. This is often used, because Nodes can have a multitude of different relationships to each other. For example you can have a friend that is also your boss. A coworker can be your friend and your football team partner at the same time.
To go over your graph, you need a starting node. Like a user. You then can find related entities by traversing over edges. The edge types are named. So you can define a “has” edge type and a “wants” type. This way you don’t need to filter out unwanted relation entities. Best practice is to always use a new type of edge whenever you add a new functionality. If you need 2 edge types in a query, you can query both at once. Because of the graph design, aggregations over the whole dataset are very slow, since the database has to look at each element. It is better when you cache such values. Examples where you would probably want to include caching are list of people with more than 1 million followers or most popular videos.
The query language of neo4j is called Cypher. It reminds a bit of SQL and is made for querying relations. Cypher uses ASCII art to make the query more visual and easier to understand. To match a node connected over the relation “LIKE” you would write (a) -[:LIKE]> (b). There is also a native API available for Java, if you don’t like ASCII art. It allows you to specify how many node levels your query should traverse at maximum during queries and also allows you to find shortest paths inside of the graph.
Internally, all records are mapped to Java objects. Relations and properties are each stored in lists. You should use nodes for categorical data, like employers or companies. If you don’t do that, a name change leads to you either going over all the data or you change your code so it can read the old name and replaces it whenever you are accessing a user with that employer. Also it helps you to make your life easy when you want to find all employees that work at a given company. You will then also be able to give properties to that data. Neo4j offers transaction management on par with a RDBMS, so you can entrust data that needs transaction protection to the database without a second thought.
Technically, pretty everywhere where you think about denormalizing to improve performance or when you have interconnected data. Joins create an overhead on a RDBMS that Graph Databases avoid by simply traversing over edges. It helps you to make suggestions to users for people they might know of products they probably want, because you can easily analyze the network around the user. The ability to simply add new types of relations helps you when you want to add functionality. Neo4j also has an optional support of schemas. Most of the queries can run in real time, so you your application can make decisions on the fly based on your current data.
Well, for starters, if you have an application with very few relations between objects. Log entries for example. Since aggregation over the whole dataset is neither optimized nor a designed use-case, you don’t want it on a collection, that you just search without a starting point. Look at Elasticsearch if you want to do efficient searching in large data sets. Just remember, if your use-case does not include much relations between your entities, then a graph database might not be the right thing for you.