In the last post on this blog various python libraries were listed as candidates for use in the data tagging project. After some further research 'pymc' has been selected.
The library is hosted on github at - https://github.com/pymc-devs/pymc
It is a feature rich library and looks to be a good fit for the project. The last couple of weeks have involved becoming acquainted with the library and attempting to implement the functionality which had existed with the custom Bayesian network code. Whilst throwing away progress is always painful the investment is looking to be a good one.
The following is a whistle stop tour which includes creating a Bayesian Network with PYMC and then querying data from it.
Once again the example network and data being used is the 'sprinkler' example taken from the wikipedia page.
Shown below is just a reminder of the layout of the network and the CPTs associated with the nodes.
T F ********* 0.2 0.8 A * T F ******************* F * 0.4 0.6 T * 0.1 0.99 B A * T F *********************** F F * 0.0 1.0 F T * 0.8 0.2 T F * 0.9 0.1 T T * 0.99 0.01
Setting up the network
To create the internal representation of the network is relatively simple.
Defining a simple node
To define a simple node with no conditions (such as node A) is simple.
nodename = pymc.Bernoulli('nodename', 0.2)
Defining a conditional node
When there is conditions (such as node b) it is still quite simple. This time one must first setup the conditional probability.
probNodeName = mc.Lambda('probNodeName', lambda conditionalNode = conditionalNode: pylab.where(conditionalNode, 0.01, 0.4))
The next step is to create the node using the conditional probability defined above. Here it is being modeled using the Bernoulli distribution, however there is support for a range of distributions.
nodename = mc.Bernoulli('nodename', probNodeName)
P( A | C)
This is one of the most basic questions to answer with a Bayesian network. With the network setup it is simple to answer this question.
First we must set C as observed. To do this the following parameters are added during the node setup.
value=[1.0], observed=True
The network is then sampled. In this example the Markov chain Monte Carlo sampling technique is used. This is down with the following code were model is the model that contains the network.
model.sample(60000)
When this has been completed the answer to the above question can be simply read out of the graph.
m.nodename.stats()['mean']
Which gives an answer of 0.32658333333333334 or ~= 32.65% which is not 100% accurate but would improve with further sampling.