Training Bayesian Belief Networks

As previously discussed Bayesian Belief Networks are being used within this project to perform high level type inference. To do this inference will be performed using the constructed graph.

However before inference can take place it is first necessary to train the Bayesian Network.

The graph being used for this initial test contains the following nodes:

LatLon, £, $, Int, Money, Float, Location, Number, €, String, PostCode,
Date, Name

It can be represented graphically as follows:


The labeling of data for training is implemented in two ways.

The first method uses the basic type inference system as previously described. For each piece of data each data type is considered and if the method returns true it can be determined that a positive observation for that datatype has occurred.

The second of method involves the hand labeling of data columns. When the caching system (as previously described) reads in the data it produces an additional text file that stores the column labels and allows for the columns to be labeled by hand. When the cached data is read in for processing at a later date the labels are loaded into to the data structure that contains the data.

When the training method is called it can then look to see if there is a label for this column and hence for every piece of data in that column mark a positive observation for that datatype.

Storing the Evidence

Currently conditional probability tables (CPTs) are used to store the training data. This has the limitation in that only features that are boolean in nature, such as "is it raining?", can be represented. Meaning continuous data, such as quantity of rainfall, can not be represented at this time.

With this said the CPTs are represented internally in the following way.

For each row in the CPT there is the following:

[Array of dependency nodes], [Array representing the boolean values of the dependencies], [Positive Observations, Total Observations]

Intuitively it can be seen that a probability can be derived by the following,

P = Positive Observations/Total Observations

And similarly the probability of it not being a particular type can be derived by the following,

¬P = 1-Positive Observations/Total Observations

The probabilities are not stored as a float internally for efficiency reasons.

The number of rows in a CPT is equal to to 2^(# of dependency nodes). To derive all the combinations of True and False values for all of these rows is simple. One simply counts from 0 to the number of rows and converts the integer number to a binary string and then converts it to an array of True and Falses.

When the training system identifies a positive observation it must be attributed to the correct row within the CPT. To do this it traverses all the nodes linked into the current node, for each it get a True or False value which represent their state in regards to this piece of data. These are assembled into an array matching the format of that within the CPT. The CPT is then traversed until a matching array is found and hence the correct row has been found. The positive observation is then attributed to this row.


Please note that 1. the training set is very limited and 2. the basic type inference is not complete so the actual values here are inaccurate but it serves to demonstrate the training system.

Int   *   Present   Observed
F     *   0     598   
T     *   0     12    

Present   Observed
0     610   

Present   Observed
0     610   

Present   Observed
12    610   

£   Number   $   €   *   Present   Observed
F     F     F     F        *   61       610      
F     F     F     T        *   0        0        
F     F     T     F        *   0        0        
F     F     T     T        *   0        0        
F     T     F     F        *   0        0        
F     T     F     T        *   0        0        
F     T     T     F        *   0        0        
F     T     T     T        *   0        0        
T     F     F     F        *   0        0        
T     F     F     T        *   0        0        
T     F     T     F        *   0        0        
T     F     T     T        *   0        0        
T     T     F     F        *   0        0        
T     T     F     T        *   0        0        
T     T     T     F        *   0        0        
T     T     T     T        *   0        0

Present   Observed
73    610   

LatLon   PostCode   *   Present    Observed
F       F          *   0          610        
F       T          *   0          0          
T       F          *   0          0          
T       T          *   0          0          

Int   Float   *   Present   Observed
F    F       *   171     537     
F    T       *   0       61      
T    F       *   0       0       
T    T       *   12      12      

Present   Observed
0     610   

Present   Observed
610   610   

String   *   Present   Observed
F        *   0        0        
T        *   0        610      

Int   *   Present   Observed
F     *   61    598   
T     *   0     12    

String   *   Present   Observed
F        *   0        0        
T        *   0        610

Leave a Reply

Your email address will not be published. Required fields are marked *