In the previous post about importing the Tate’s open metadata into a Neo4j graph database, I mentioned that I was using Python. I quickly discovered a major performance issue when using Python to batch import into Neo4j. The main issue is that to use Python, I had to run the Neo4j database as a server, requiring REST-based access. This introduced a huge overhead and it crawled along processing a handful of artworks per minute. With nearly 70,000 artworks, it might have taken days for the import to run. Rather than spend too much time investigating whether this approach could be optimised to the point it would be usable, I decided to go with the obvious choice of development language for Neo4j – Java.
Neo4j is a Java-based database. One doesn’t always associated the words Java and performance in the same sentence. I mean, isn’t that the realm of C and C++ (e.g. Redis is C-based). But in Neo4j’s case, Java doesn’t seem to hold it back. There is no method of access to Neo4j that can be as fast as when using the Java libraries. However, while Neo4j is lightning fast in terms of retrieving nodes (graph traversal), there is a big overhead when writing data. Unlike a relational database, where the links between tables are logical using foreign keys and so writing data is a relatively inexpensive operation, the links between nodes in Neo4j are physical ones requiring their own data structures to be stored and result in generally a higher write latency. On top of that, Neo4j is transactional. This adds more of an overhead when writing. For a batch process, processing around 100,000 nodes and up to a million or more relationships, this is a non-runner.
The solution is Neo4j’s BatchInserter. This effectively bypasses Neo4j’s database management system and all its transaction handling and goes directly to the files on the disk. It is orders of magnitude faster when writing. However, because I was creating relationships between nodes, I needed a way to store the physical node ids for later lookup. The basic algorithm of my Tate batch importer is as follows:
For each artist read from its JSON file: Add the artist node For each movement within the artist: If the movement node was not already added, add the new movement node Link the movement node to the artist End For If the artist's place of birth was not already added, add the new place node Link the place of birth node to the artist End For For each artwork read from its JSON file: Add the artwork node Connect the artwork to the artist If the catalogue group was not already added, add the new catalogue group Link the artwork to the catalogue group For each movements within the artwork: If the movement node was not already added, add the new movement node Link the artwork to the movement End For For each subject within the artwork (having traversed from top level to second level to third level): If the subject node was not already added, add the new subject node (add as a person node if a "named individual") Link the artwork to the subject End For If the classification was not already added, add the new classification node Link the artwork to the classification node Parse the medium to its elements based on separators (commas, spaces, and, on) For each medium element: If the medium node was not already added, add the new medium node Link the artwork to the medium End For End For
The main issue with the algorithm is the frequent lookups to see if a node was already added. There are currently close to a million of these lookups. Because I am using Neo4j’s BatchInserter, access to a facility to do a search for a node is not available. However, what the BatchInserter does provide is a physical node id for each added node. The solution is to store the ids from in the JSON file (and each artist, artwork, movement, subject, has an id, others such as medium and classification will just have the name as a key) along with the physical node ids generated by the BatchInserter in some kind of lookup store. The solution I chose was Redis, which is a very fast key-value store. I built the latest version from source code and ran it on the local host. It’s a very simple storage mechanism – I just prefix each of my keys with something like “artist:” or “movement:” and store the key and the value in the one giant lookup table. For example, using redis-cli to issue a commad to lookup the physical node for the artist with the id of 1234:
127.0.0.1:6379> get "artist:1234" "4235"
While it is possible to run my importer on a Windows-based machine (e.g. using vagrant to host the redis service within a headless Ubuntu instance), the best option will be a unix-based machine. For example, using the vagrant solution on Windows [Core i3 laptop with 12GB RAM and a SSD drive], the process took about 23 minutes, which is more than tolerable. [Update: I installed Microsoft’s Windows-compiled version of Redis, which reduced overall time to about 8 minutes] However, the same process took about 3 minutes on a similarly spec’ed Linux Mint PC [Core i3 with 8GB RAM]. [Note: on a 7-year-old Linux laptop with 4GB RAM it took 10 minutes].
3 minutes to read through about 73,000 JSON files, map them to Java objects (Using Jackson), create 93,631 nodes and 719,766 relationships (edges) in Neo4j, using Redis to store 24,766 key-value pairs with almost 1.2 million key lookups. Doesn’t sound like too many bottlenecks there!
I am making my importer publicly available through Github: https://github.com/dlarkinc/tate2neo4j (still very unstable at this point)
If you have improvements you can suggest, please feel free to get in touch.