On this page you'll find information relating to my 4th Year Project, an automated dataset tagging system for datasets contained within a CKAN instance.
Progress for this project can be followed via its Blog Entries.
Current Working Title
Creating a hypothesised data tagging system using analysis of datasets through Bayesian belief networks
The University of Bristol is developing a petascale research data repository as part of its data.bris initiative. Research datasets published through data.bris will be accessible through a web portal running the Open Knowledge Foundation's CKAN software.
In the CKAN software, datasets are stored with associated tags that describe them. Currently the uploader of the dataset is required to select these tags when they perform the upload. Whilst this works in theory it has many shortcomings, the most obvious of these is that it requires an understanding of the dataset to be held by the uploader. Secondly it means that the description of the dataset will be limited to the tags that can be thought of by the user at the time of uploading, which will likely result in a range of tags being missed.
This project will use machine learning techniques to create a set of hypothesised tags to describe each dataset. By considering all possible tags in respect to all the datasets in the catalogue, the system can discover the underlying relationships between them. This will improve discoverability of related datasets, a problem which appears to be greatly unsolved in all but the most highly curated instances of CKAN.
The Long Form Motivation for This Project, as Taken From the Initial Specification
The open source CKAN software is a data portal platform that serves as a catalog and repository for data sets. It has been adopted by various institutions including HM Government in the form of "data.gov.uk".
The aim of these initiatives is to open up the data produced by an institution to the public. One of the core problems with this is discovering the underlying relationships between data sets once you have a significant number of them. The current approach to this revolves around the use of searchable titles, metadata and tags. Whilst this works in certain circumstances it does not allow for links between data sets that were not thought of by the person uploading the data into the system.
The aim of my project is to create a system that generates tags based on the type of data contained in a data set through probabilistic modelling for each data set. By using a canonical set of hierarchical tags in the system this will identify the underlying links between related data sets.
The hierarchal nature of the tags will allow linking up and down the hierarchy to facilitate subtypes to be recognised further increasing discovery of related data sets, for example both latitude and longitude and an address represent a geographical location.
Once the tagging system has been implemented and the underlying relationship between the data sets has been exposed it opens up a range of opportunities within the existing system. The most obvious of these is the improvement in discoverability of data sets which is largely unsolved in the current system. Social networking is also encouraged in a variety of ways, firstly by identifying related data sets it also exposes related researchers and data providers. Another social aspect will be introduced by allowing users of the system to provide feedback on the accuracy of tags which will in turn serve as feedback into the system. In addition to this the social graph associated with the data contributors and consumers will be explored to possibly provide further signals into the system. Another benefit is that the system will reduce the time taken to add the data into the system and reduce the comprehension of the data required by the inputter as it will not require them to come up with the tags themselves.