Dynamic Bayesian Combination of Multiple Imperfect Classifiers

Using the new Voxcharta.org system, I was the only physicist at my institute to upvote this paper, Dynamic Bayesian Combination of Multiple Imperfect Classifiers (pdf), more in the realm of machine learning or computer science than traditional astrophysics or astronomy. As such I was nominated to discuss it at our weekly journal club. Here I give a brief review of concepts needed to follow the paper, and then go in depth into how we can use the opinions of multiple lay people as to whether an object is a supernova or not to achieve a highly accurate classification at the expert level.

Some review to understand the paper

Markov Chains

Markov Chain 

  • sequence of random variables with the Markov Property: given the present state, future and past states are independent

MCMC

  • Markov Chain Monte Carlo: an algorithm for sampling from a probability distribution based on constructing a Markov Chain which has the desired distribution as its equilibrium distribution

Gibbs Sampling

  • MCMC algorithm for obtaining random samples from a multivariate probability distribution
  • Good for e.g. sampling posterior distribution of a Bayesian Network

Bayes

Bayes Theorem

P(A|B)P(B)=P(B|A)P(A)

or

P(A|B)=P(B|A)P(A)/P(B)

or

posterior=likelihood*prior/marginal likelihood

Bayesian interpretation of probability

  • The equations above are thought of as modeling a hypothesis and evidence which supports the hypothesis.
  • P(H|E)=P(E|H)P(H)/P(E)

Bayesian inference

  • Derivation from Bayes Theorem if H1 is one hypothesis and H2 is another:
  1. P(H1|E)=P(E|H1)P(H1)/P(E)
  2.  P(H2|E)=P(E|H2)P(H2)/P(E)
  3. dividing 1./2. => P(H1|E)/P(H2|E)=P(E|H1)/P(E|H2)*P(H1)/P(H2)
  • Think of the left hand side giving us the odds of hypothesis 1 over 2 after the the evidence is seen, and the terms on the right hand side giving us probabilities prior to the evidence being seen
  • Thus we can use Bayes Rule helps us to relate probability models before and after new evidence is observed in the Baysian interpretation.
  • Example: If we use this model for a binary classifier that decides whether something is a funny picture of a cat or not: H1 could be “this object is a funny picture of a cat” and H2 could be “this object is NOT a funny picture of a cat”. Bayes rule lets us see new evidence and compute the new probabilities based on that evidence.

Variational Bayes Methods

  • and alternative to MCMC methods such as Gibbs sampling for approximating the posterior probability

Misc

Receiver Operating Characteristic Curves

  • graphical plot which illustrates performance of binary classification
  • true positive rate vs. false positive rate is plotted
  • a curve as each point on the plot corresponds to a given threshold of the classifier.
  • the closer the area under the curve is to one, the better the classifier performance

Dirichlet Prior

  • multivariate generalization of the beta distribution
  • often used a a prior distribution in Bayesian statistics
  • its form depends on its concentration or hyperparameter

The paper itself

Zooniverse

http://www.galaxyzoo.org/ is an amazing project trying to use the judgement of hundreds of thousands of citizen scientists on questions of astronomical relevance–classifying an object observed by a telescope as a particular type of galaxy, for example–questions on which humans still out perform computers. However, expert humans, professional astronomers, still do better than lay people. So the question is two fold, how to best use the judgement of the lay person to get to the scientific truth as quickly as possibly, and relatedly how to help them more quickly approach the accuracy of a professional.

This paper addresses how to combine the judgements from multiple people for the Galaxy Zoo Supernovae project. Supernovae are violent stellar explosions and  volunteers are asked to classify an object as “definitely a supernova” “possibly a supernova” or “definitely not a supernova”, and for many of the objects multiple volunteer’s judgements are collected. The vast majority of objects are not in fact supernova. For the first run of the project professional astronomer’s checked the work–so the paper has the unique position of knowing the “right” answer, and being able to test out different methods of combining volunteer judgements to see which method is most likely to get the right answer. The model of the authors is called the “Independent Bayesian Classifier Combination” and is detailed in equation 1 of the paper.

looks a bit complicated, eh? Well let’s tease it out graphically using terms I reviewed above

In this model, c is a group of classifications for a given object indexed by i. The classifiers themselves are indexed by k. So by specifying a k we pick a classifier-say Alice, and by specifying an i we pick an object that that classifier classified, say SuperNovaCandidateA. is the actual true classification of the object, in this case we know the answer as we have the object classified by a professional astronomer. So far so good. So we have a bit more in our model, κ are the parameters on which the true classification depends on. Presumably the professional astronomer considered many factors in her or his judgment and didn’t pull it out of thin air-this is an attempt to capture the fact that the correct classification depends on several factors. Finally, there is π. We can see  π has a index, which gives us a clue that it is dependent on who classified the object. It is in fact something called the confusion matrix which represents a classifier dependent function indicating how confident we are in that classifier’s classification. If  classifier Alice always gets identifications of supernova correct, but sometimes misses a few (no false positives, several false negatives), and classifier Bob always identifies them but sometimes says something is a supernova which isn’t (a few false positives, no false negatives) this is the factor which is able to model that. The last two things in our model are the α and the ν. These come from the fact that the we are assuming Dirichlet priors on the distributions of both π and  κ and α and ν are the hyperparameters on these distributions.

OK, so now that we have a model, what do we do with it?

Variational Baysian IBCC

Using the model we perform inference for the unknown variables  t, π, and κ plugging in the classification of each of the various classifiers for each of the various objects. In the end we are mainly interested in  t, which we can use to compute the efficacy of our model of combining the various judgments of amateur classifiers by comparing to the true classification by the professional astronomer. If our model is perfect, will always match the professional! Moreover, as the model is a tad complex, we’d also want to see if the complexity is worth it; namely do we get better results with it than with a more simple model (say having the individual classifiers vote on whether the object is a supernova or not-or taking the average of their scores). These results are shown in figure 2 of the paper:

as we can see the ROC curve of the VB-IBCC is the best of all the methods. Moreover, the authors show that it is quicker than its closest competitor Gibbs-IBCC. Great news! But we can do even more; the model has also inferred the confusion matrix π.

Communities of Decision Makers Based on Confusion Matrices

The authors group decision makers by applying an approach they detail in a previous paper, Overlapping community detection using Bayesian non-negative matrix factorization, Psorakis et. al. 2011. Its methodology is best explored by going into the details of the original paper and I’ll just present the results here, figure 3 from the paper, prototypical confusion matrices for each of the 5 communities inferred with this methodology, the confusion matrices themselves coming out of the results of the current paper.

here we can see for example that group 1 are clear if an object is not a supernova, but who are less certain in their classification if it is. Group 2 are extremists, mostly they classify as “definitely not” or “definitely is” a supernova, and when they are uncertain the object tends not to be a supernova after all. Definitely neat to see such easy to describe groups come out of a big chunk of math.

Conclusions

Pretty cool–we have a method of combining information from multiple decision makers who have varying degrees of expertise in a way most likely of tested methods to give us the correct classification of an object. Moreover, for free we get information about the type of classifier our Alice or our Bob is, meaning that information could be potentially used to guide classifiers into improving their accuracy even further.  The rest of the paper explores these themes, developing models for how confusion matrices change over time-which involve tweaking the model above which assumes a static confusion matrix for each classifier, and then exploring these changes for the Galaxy Zoo Supernovae classification communities. Next up for the authors is exploring ways to select users for a task based on the results of this work, either for training purposes (to improve the performance of the classifier themselves) or when a correct classification is especially important to use the confusion matrix information to select for a high-quality tester.

Leave a Reply

Your email address will not be published. Required fields are marked *