In previous posts I mentioned the importation of The Tate metadata into two databases – Neo4j and MongoDB. Why am I doing this?
Martin Fowler coined the term polyglot persistence, by which he means choosing the right database model / engine for each particular data requirement, even within a single application. Take for example my Tate-to-Neo4j import program (written in Java). I needed to create a lookup store to cross-reference the graph node ids that had been generated when inserting artists, movements, subjects, etc, so that they could be reused during the import. I chose to use Redis, which is a lightning-fast key-value store database. One application, two different data requirements, two databases – key-value and graph.
With regard to the social web app, it is entirely conceivable that I could handle the entire data storage on the Neo4j graph database. However, I don’t see that as a practical architecture – for performance, it is best to run Neo4j in embedded mode, which makes the database unavailable to other applications. So trying to use Neo4j as a general purpose database isn’t ideal. It is ideal, of course for all the highly-related data, e.g. friend-of-a-friend, exploring recommendations from user “appreciations”, personal galleries, similarity between artworks and artists using shortest paths, and so on. Each node would just contain a label and an identifier, not any additional properties (yes, Neo4j is a property-graph database, but that doesn’t mean I have to store lots of properties). Relationships, on the other hand, will often have properties, for example a friend of a user since a date and time, a user appreciated an artwork rating it 4 stars. But for doing the mundane stuff, like retrieving a complete artist or artwork metadata record, something else is likely to be better suited.
In my previous post, I updated the instructions for how to import Tate metadata into MongoDB on Linux, which was an update of an older post for Windows. For the aggregated data of the Tate JSON documents, MongoDB is ideal – the import is a direct one. For simple retrievals for display, for example, on a webpage, this is efficient. Everything is retrieved in one fast get operation – no joins to worry about. For the supplementary data, such as other recommended artworks, a service based on the Neo4j database can be used.
And the Tate themselves will use their own software and databases to manage their data – that’s not the focus of my project, which imports data from other sources.
This approach allows me to wrap the Neo4j-based social / recommendations application within a number of web-services, e.g. using REST. That the Neo4j database is embedded and inaccessible from other applications is then not such a concern – I just have to code the exposed services using, for example, Spring MVC.
What if I want direct access to the graph database for data analytics / visualisation? Shutting down the webservices application and taking a copy of the database for that purpose would be the way to go. The copy would be fast (few hundred MBs at most) and the webservices application could be restarted with little downtime.
I know many developers in the digital humanities will be familiar and comfortable with the relational data model. And that’s fine – an RDBS will get the job done, but it does lead to difficulties when developing applications, such as the impedance mismatch between relations and objects. This is something NOSQL databases, like MongoDB and Neo4j, address allowing for more rapid development, taking advantage of data-to-object mapping.