The end result of this project is to hypothesise tags for each dataset stored within a CKAN instance. As part of this it is necessary to look at the data within every dataset. To do this one must first understand the type of data that is being looked at, to understand the primitive types that each piece of data is.
Primitive Type Inference System
Any one who is familiar with basic programming understands the basic concept of a type. "String", "Float" and "Integers" are all well known examples of this. In addition to this, to aid in the later processing of the data it useful to provide as much information as possible so additional type have been added to these well known datatypes. Currently these are "Currency" and "Date" but more will be added to the system as time goes on.
Hierarchal Data Types
At this point the hierarchal nature of datatypes can already be seen. For example:
Date -> String
Currency -> Float -> String or Currency -> Integer -> String or Currency -> String
(Where the left most datatype is the highest level.)
The next iteration of this system will represent these, however as is stands at the moment they are not. It does however describe a challenge to the inference system. If we were to firstly consider if a value is a string the answer will always be true, as anything that we can load in from a file will be able to be represented as a string, it is after all the type that value be read from the file as.
After doing some cursory reading in this area it was decided to attempt a very simplistic prototype to better understand the problem.
To do this a range of methods were created that are able to state if a value is of a particular type. Each one of these have been implemented in different ways ranging from regular expressions (dates) to simply attempting a cast and seeing if it succeeds (integers).
By taking into consideration the hierarchal nature of types as discussed above the system then can simply test a value against each of the decider methods in order from highest level type to lowest and return the value type as soon as on of the methods returns True.
- Additional type information can be used to help later processing.
- Type data can be represented in hierarchal fashion.
- A simplistic type inference implementation.