Downloading and Importing the Tate Dataset

The Tate is probably most famous for the Tate Modern in London, but has other galleries too. They recently made available a dataset of all their artworks and artists.

I decided to investigate. I discovered that the dataset is stored on gitHub and is available in CSV and JSON formats. The JSON version is the most comprehensive.

This was perfect for my teaching requirements. This coming week I am covering the concept of the document-oriented data model. JSON is ideally suited to this and can be loaded into most document-oriented databases, such as the one I will be using by way of example, MongoDB.

I encountered a slight issue, however. The dataset is organised into subfolders, with each artist and artwork in its own individual JSON file. MongoDB’s import command only works with single files. So I had to devise a way to recursively drill down through the folder structure downloaded from gitHub.

I achieved this by searching for a Windows shell command (would have been easier to find a Linux solution) that would recursively list the files in a folder. I then had to prepend the mongoimport command to import the documents into a collection in a new database. I did this to create an artworks collection and an artists collection.

If using Windows, you open the command prompt, change directories to go into the artworks folder and issue the following command (assumes MongoDB in installed in C:mongodb and that you want to use the database tate):

(for /r %i in (*) do @echo C:mongodbbinmongoimport --collection artworks --file %i --db tate --jsonArray) > out.txt

[Note: afterwards, you need to delete the listing of the out.txt file from out.txt]

Repeat for the artists collection, just changing the collection to artists and go into the artists folder in the collections downloaded from gitHub.

Easiest thing to do then is to copy and paste from the out.txt file and paste into the command prompt and see it add thousands of documents – difficult with the size of the artworks file. Otherwise develop it into a batch script. Linux is the better platform for that.

Is there an easier way to do this?

I’ll post again on the type of analysis I perform on the data.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *