• Clinical Informatics
My group is dedicated to developing machine learning algorithms to leverage large medical data stored in Electronic Health Records. The ultimate goal is to extract clinically useful medical knowledge toward a better diagnosis and prevention of diseases and conditions. Our approach brings to bear new methods to derive accurate, multi-dimensional models from large collections of observational data. Our research team brings together machine learning, natural language processing experts, database architects and programmers from the Center for Computational Learning Systems (CCLS) at Columbia University along with clinicians from Columbia University Medical Center (CUMC). Currently, our research focuses on two health outcomes: preterm birth and infantile colic. For more information, please see: http://www1.ccls.columbia.edu/~ansaf/CING/
• Crowd Sourcing
Crowd Labeling emerged from the need to label large-scale and complex data, a tedious, expensive and time-consuming task. Each object to label is generally annotated by multiple crowd labelers, and the collected labels are combined to infer one final estimated label. One open problem is the quality and integration of different labels, especially when the labelers participating to the task are of unknown expertise. In order to address this challenge, we propose a new framework that automatically combines and boosts bulk crowd labels with a limited number of ground truth labels from experts. We show through extensive experiments that, unlike other state-of-the-art approaches, our method is robust to estimate true labels even with the presence of a large proportion of not-so-good labelers in the crowd.
Ranking problems arise in a wide range of real world applications where an ordering on a set of
examples is preferred to a classication model. These applications include collaborative ltering,
information retrieval and ranking components of a system by susceptibility to failure.
• Ranking by Susceptibility to Failure
In the context of a project funded by Consolidated Edison, company of New York, I was leading a project on ranking feeders and network protectors by susceptibility to failure. We describe an application of machine learning techniques toward the problem of predicting which network protector switch is the cause of an Alive on Back-Feed (ABF) event in the New York City power distribution system. When an electrical feeder is shut down, all network protector switches connected to the feeder should open to isolate the feeder. When a switch malfunctions and does not open, electrical current ows into the feeder, which remains energized. This causes the feeder to be alive on back-feed current, and maintenance cannot proceed. Our goal was to provide a ranking of network protector switches according to their susceptibility to such malfunction. Such a ranking can assist prioritization of which switches to repair when an ABF event occurs. We compare three methods for computing a ranking: an Support Vector Machines (SVM) classication approach, a maximum entropy density estimation approach and an SVM-ranking approach. In an online setting, we present a project to rank the underground primary feeders of New York Citys electrical grid according to their susceptibility to outages. We describe our framework and our online ranking application of machine learning ranking methods, using scores from Support Vector Machines (SVM), RankBoost and Martingale Boosting.
However, for many ranking applications we would like to understand not only which items are
top-ranked, but also why they are top-ranked. However, many of the best ranking algorithms (e.g.,
SVMs) are black boxes that give little information about the factors for their rankings. We
describe and demonstrate a new approach that can work in conjunction with any ranking algorithm
to discover explanations for the items at the top of the rankings. These explanations are in the form
of rules expressed as boolean combinations of attribute-value expressions. These rules are discovered
by contrasting attributes of items drawn from both the top and bottom of a ranking list, looking for
items that have high leverage, corresponding to rules with broad coverage and sharp dierentiations.
Some of my current research is about using rule learning for recommender systems, Multi-label learning
for word sense disambiguation and Epilepsy prediction.
• Rules for recommender systems
An exciting application for rule learning is to use this
framework in recommender systems: Customers at online grocery stores often order items that have
been previously purchased. This allows for the possibility of exploiting patterns in a customers past
transactions to design a personalized recommender system. In collaboration with professor Cynthia
Rudin (MIT) and other colleagues at Columbia University, we proposed in a novel recommender system
based on an association rule framework, and provide a generalization analysis of this method based on
the statistical learning notion of algorithmic stability. Experimental results indicate the advantages
of our method.
• Multilabel Learning for Word Sense Disambiguation
In a different research topic, in collaboration with Dr. Becky Passoneau at Columbia University, we started investigating multiple annotators in word sense disambiguation. We presented a pilot study of Multilabel Learning for Word Sense Disambiguation using multiple annotators, relatively polysemous words, and a heterogenous corpus. Annotators selected senses for words in context, using an annotation interface that presented WordNet senses. Interannotator agreement (IA) results show that annotators agree well or not, depending primarily on the individual words and their general usage properties. Our focus is on identifying systematic differences across words and annotators that can account for IA variation. We discuss systematic differences in sense selection across annotators, and present the use of association rules to mine the data for systematic differences across annotators.
• Epilepsy prediction
I'm also a member of the Ewarn Research team. Created in June 2008, the Columbia University Ewarn team gathers researchers and staff from the Center for Computational Learning Systems and two neurologists from the Columbia University Medical school. The goal of Ewarn is to develop a wearable "early warning" device to allow epilepsy patients to live a more normal life. Such a device would be attached to an implantable microelectrode array collecting data while the patient experiences seizures. The ML trained system would then monitor the patient continuously, using a small computer carried by the patient, and reliably warn the patient in advance of seizures. Ewarn was initially seed- funded by the Research Initiatives in Sciences and Engineering (RISE) at Columbia University 2008-2010.
My current research is focused on investigating whether high frequency oscillations can play a role in
seizure prediction. High frequency oscillations (HFOs), or brief bursts in the high gamma band (80-500
Hz), have been studied as potential biomarkers of epileptic activity. My current work addresses HFO
detection by using Machine Learning in human-labeled HFOs in order to learn models that produce
a probability that an event is an HFO.
• Maximum Entropy Density Estimation with Incomplete Presence-Only Data
Our interest is the Maxent approach comes from the idea of casting the problem of ranking as a presence-only one, that is there are no negative examples, which may be a better representation of reality.
For instance, in the problem of ranking network protectors, there is no need in the presence only
framework to consider network protectors that have never been culprits in an event as healthy, negative
examples. We would only consider the failed network protectors as positive examples. In that context,
we demonstrated a generalization of Maximum Entropy Density Estimation that elegantly handles
incomplete presence-only data. We provide a formulation that is able to learn from known values
of incomplete data without having to learn imputed values, which may be inaccurate. This saves
the effort needed to perform accurate imputation while observing the principle of maximum entropy
throughout the learning process. We provide analysis and examples of our algorithm under different
settings of missing data.
My dissertation work focused on rule learning and more precisely on association, characterization and
exception rules with geographic information systems as a target application. In the following, I'm
reviewing my main contributions during my Ph.D. thesis and postdoctoral work.
• Rule Learning
Rule induction has attracted a great deal of attention in Machine Learning and Data Mining.
An association rule is an expression C1 ) C2, where C1 and C2 express conditions on features describing the instances in a dataset. The strength of the rules is usually evaluated by means of statistical measures, as for instance support and confidence:
Given two thresholds MinSupp (for minimum support) and MinConf (for minimum confidence), a rule is strong, when its support is greater than MinSupp and its confidence greater than MinConf. Discovering strong rules is usually a two-step process. The rst step consists in mining frequent itemsets that meet MinSupp. The second step relies on this set to discover strong rules that meet MinConf.
During my PhD thesis I contributed to the topic of Rule learning by extending the framework of association rules to Geographic Information Systems where data is not stored as a at table but as a a set of geographical layers each of which described by a set of features. My main contribution was also algorithms and data structures relying on Binary Decision Diagrams for frequent pattern mining, a crucial step in many data mining task including association rule mining.
With co-workers, we also proposed QuantMiner, a mining quantitative association rules system. This system is based on a genetic algorithm that dynamically discovers good intervals in association rules by optimizing both the support and the confidence. The experiments on real and articial databases have shown the usefulness of QuantMiner as an interactive data mining tool.
I also addressed the characterization task that aims at finding properties shared by a
set of objects, called the target set. With co-workers, we proposed a general framework for learning
characterization rules in relational databases modeling not only the characteristics of the objects,
but also the properties of the objects linked to them. The patterns of the rules are expressed with
existential and universal quantifiers, but also with more flexible quantifiers, namely "there exists at
least n objects" and "for at least p%". We have extended our framework to Geographic Information
Systems by introducing buers modelling the spatial relations between objects. Other research to
mention falling under rule mining umbrella is a survey on exception rules, that are rare but strong
associations that we call nuggets.
• On the usefulness of rules
Generating rules is not an end in itself because their applicability is
not so straightforward especially when the user has to analyze/use a large number of rules.
During my postdoc, I proposed with co-workers, an approach to lighten this burden when
the user wishes to exploit such rules to decide which actions to do given an unsatisfactory situation.
The method consists in comparing a situation to a set of classication rules. This is achieved using a
suitable distance thus allowing to suggest action recommendations with minimal changes to improve
that situation. We proposed a new algorithm for learning action recommendations and we present
an application to an environmental protection issue. Our experiment shows the usefulness of our
contribution in decision-making.