Building a social cultural heritage website using Spring Data Neo4j

In previous posts, I have discussed importing the Tate’s metadata into a Neo4j database using Neo4j’s BatchInserter. I have been able to run some interesting queries using CypherQL to visualise the connectedness of some of the Tate metadata, including shortest paths between artists to illustrate what they have in common. But all of that is foundational stuff. What the bulk of my research (for my PhD in particular) is concerned with is building a social media platform for cultural heritage objects. The Tate metadata can be viewed as an exemplar of an open data set, with its high-quality metadata being an excellent seed for my project. I may come to view the Tate’s approach to metadata as being my standard, or I may come up with a multi-museum/gallery compromise standard and map the Tate’s to mine – bearing in mind various standards for metadata, such as Dublin Core.

But I have turned in the last couple of weeks to the development of this new social media platform – notwithstanding the fact that I still have some work to do on the importation, such as a little bit of data cleaning. A discussion then about the technological side of the new platform.

I have for a number of years used the Spring Framework. While I used it a little bit in a job I had a good 8 or 9 years ago, I first became somewhat expert with it when I started to teach it to higher diploma students about 2.5 years ago. The pace of development by the contributors to the Spring open source project is actually quite staggering. In the past 2.5 years, the introduction of the likes of Spring Boot have massively simplified the configuration of projects. It is much easier now to get straight to what we would call the implementation of business logic, which is where the real value lies.

Having said all that, I have not been immune to the frustrations involved in getting to grips with a new enterprise-grade product. As far as programming goes, this is pretty high-level stuff. I have been getting to grips with things like Spring Data repositories, Neo4j object-graph mapping, and getting Spring Boot to work to the extent that I have eliminated all XML configuration from the project. I think it’s far to say that I am developing this product the right way from the ground up. If others were to join the project, I’m confident that they will be able to find their way around it easily enough. I’m also confident that it will scale.

Let’s talk a little bit about the social aspects of this new platform. It will include the following:

  • Appreciating artworks and artists (much like Facebook likes, only I use the term appreciate)
  • Building personal, shareable galleries
  • Crowdsourcing of image tags
  • Gamification elements, such as getting points for contributing image tags or having others favourite your gallery, badges for physically visiting museums and objects
  • Mobile device support using NFC – view metadata while in the museum / gallery, earn additional points and badges for being physically on site and swiping the device near an NFC tag
  • Following and / or friending other users
  • Recommendations of artworks and artists based on your viewing habits

Using a graph database (Neo4j) and Spring’s support for object-graph mapping (OGM) makes implementing the above list much easier than it would be using, for example, a relational database. While an RDBMS can be used, the ability to simply link two objects in a graph database using 1 or 2 lines of code cannot be overstated.

Here is a code snippet from my UserServiceImpl class:

    public void addFriend(String friendLogin, final User me) {
        User friend = repo.findByLogin(friendLogin);
        if (!me.equals(friend)) {
            me.befriend(friend, true);
            repo.save(me);
        }    
    }

And in my User entity class:

   @RelatedTo(type = FRIEND_OF, direction = Direction.BOTH)
   @Fetch private Set<User> friends;

    public void befriend(User friend, boolean flag) {
        if (flag) {
            this.friends.add(friend);
        } else {
            this.friends.remove(friend);
        }
    }

When I retrieve a user, it will eagerly fetch the list of friends for the user (note the bidirectional relationship, FRIEND_OF, from user to user). The user service class retrieves the user object for the new friend and as long as the two users are not the same (cannot friend yourself, of course), then calls the befriend method in the User entity to toggle the friendship on or off. repo.save (i.e. call the user repository object to persist) handles all of the connecting in the background.

Here’s what I coded to date:

  • Authentication and authorisation is now set up
    • a user can login as a regular user or an admin user
    • certain resources are hidden from regular users, e.g. a list of all users
    • a single Java configuration class is used to handle all authorisation across the entire system (for the techies out there, it is of course using AOP in the background to handle the authentication and authorisation cross-cutting concerns)
    • user and their login details are stored within the graph database
  • Artists can be listed and pagination is in place to go to the next or previous page
  • An artist record can be viewed
  • A user can “appreciate” an artist – i.e. much like liking a post in Facebook
  • Artworks can be listed and pagination is in place to go to the next or previous page
  • An artwork record can be viewed
  • A user can appreciate an artwork
  • A user can view the record of another user
  • A user can “befriend” another user (right now, this is more like a Twitter follow because it is an instant befriending by another user, so some work on this needs to be done to introduce an approval on the other side of the friend request)
  • Some of the unit test classes required for test-drive development (TDD), using JUnit and Mockito. Unit testing of Neo4j repositories requires a bit of test data seeding, so requires substantial upfront investment

As yet, the website has no styling and I am not displaying full artist or artwork records. In fact, I am not even displaying artwork images just yet. I am concentrating right now on the fabric of the system.

My next port of call is personal galleries, represented by the nodes and edges below:

Screenshot from 2015-02-08 15:18:56

Beyond that, the next major piece of development may well be a recommendation engine.

Downloading and Importing the Tate Dataset

The Tate is probably most famous for the Tate Modern in London, but has other galleries too. They recently made available a dataset of all their artworks and artists.

I decided to investigate. I discovered that the dataset is stored on gitHub and is available in CSV and JSON formats. The JSON version is the most comprehensive.

This was perfect for my teaching requirements. This coming week I am covering the concept of the document-oriented data model. JSON is ideally suited to this and can be loaded into most document-oriented databases, such as the one I will be using by way of example, MongoDB.

I encountered a slight issue, however. The dataset is organised into subfolders, with each artist and artwork in its own individual JSON file. MongoDB’s import command only works with single files. So I had to devise a way to recursively drill down through the folder structure downloaded from gitHub.

I achieved this by searching for a Windows shell command (would have been easier to find a Linux solution) that would recursively list the files in a folder. I then had to prepend the mongoimport command to import the documents into a collection in a new database. I did this to create an artworks collection and an artists collection.

If using Windows, you open the command prompt, change directories to go into the artworks folder and issue the following command (assumes MongoDB in installed in C:mongodb and that you want to use the database tate):

(for /r %i in (*) do @echo C:mongodbbinmongoimport --collection artworks --file %i --db tate --jsonArray) > out.txt

[Note: afterwards, you need to delete the listing of the out.txt file from out.txt]

Repeat for the artists collection, just changing the collection to artists and go into the artists folder in the collections downloaded from gitHub.

Easiest thing to do then is to copy and paste from the out.txt file and paste into the command prompt and see it add thousands of documents – difficult with the size of the artworks file. Otherwise develop it into a batch script. Linux is the better platform for that.

Is there an easier way to do this?

I’ll post again on the type of analysis I perform on the data.