## Dr. Aron CulottaTeaching | Findings | Publications | Data | CV | Schedule

 Assistant Professor Department of Computer Science and Industrial Technology Southeastern Louisiana University Hammond, LA 70402 985-549-5314 Fayard Hall, Room 327F

Research interests: Machine learning methods to discover knowledge from text.
Keywords: machine learning, graphical models, conditional random fields, probabilistic logic, natural language processing, information extraction, data mining, social media analytics

### Teaching less

Spring 2012
• CMPS280: Algorithm Design & Implementation II
• CMPS415: Integrated Technologies for Enterprise Systems
• CMPS470: Machine Learning
Fall 2011
• CMPS280: Algorithm Design & Implementation II
• CMPS482: Current Trends in Computer Science
Spring 2011
• ET221: Programming for Technologists
• CMPS280: Algorithm Design & Implementation II
• CMPS479: Automata & Formal Languages
Fall 2010
• CMPS280: Algorithm Design & Implementation II
• CMPS443: Simulation and Modeling
• CMPS479: Automata & Formal Languages
Spring 2010
• CMPS280: Algorithm Design & Implementation II
• CMPS470: Machine Learning
Fall 2009
• CMPS280: Algorithm Design & Implementation II
• CMPS393: Fundamental Algorithms
Spring 2009
• CMPS280: Algorithm Design & Implementation II
• CMPS479: Automata & Formal Languages

### Research Findings less

 Social Media Analysis Demographic differences in attitudes towards Hurricane Irene were observed in Twitter [HLT12]. Twitter analytics provides accurate estimates of both influenza-like illnesses and alcohol sales volume [LRE12,KDD10]. Scaling Probabilistic Models The parameters of an ungroundable graphical model can be estimated by learning to rank pairs of partial predictions [ICML11,ICML06,HLT/NAACL07,IIWeb07]. Unwieldy prediction problems can be solved efficiently by splitting into subproblems, solving independently, then reassembling with message-passing [NESCAI07].
 Identity Uncertainty Probabilistic ranking can automate the process of normalizing sets of duplicate record [KDD07]. Learning the compatibility of sets of mentions rather than pairs of mentions can improve coreference resolution [NIPS05,HLT/NAACL06]. Making deduplication decisions for related record types jointly can outperform making them independently (e.g. authors, papers, venues) [CIKM05,IR-443]. Relation Extraction Modeling field compatibility can improve record extraction [EMNLP06]. Mining patterns among relations can improve relation extraction (e.g. you're more likely to attend Harvard if your father did) [HLT/NAACL06]. Customizing SVM kernels for dependency parse trees can improve relation extraction [ACL04]. Interactive Learning Labeling effort can be reduced by immediately progagating any corrections a user makes to system predictions. [AAAI04,AI06]. Active learning methods should consider not only how many examples a user must label, but also how difficult each example is to label [AAAI05,AI06]. Confidence Estimation In sequence models for information extraction, field marginals are often more reliable confidence estimators than token marginals [HLT04]. Bioinformatics Discriminative models for gene finding often better capture domain knowledge (e.g. BLAST search results) than generative models [TR2005-028].

### Publications

2015
 Inferring latent attributes of Twitter users with label regularization Ehsan Mohammady Ardehaly, Aron Culotta NAACL/HLT, 2015 [pdf] [hide bib] [hide abstract]
 @inproceedings{ehsan2015inferring, author = {Ehsan Mohammady Ardehaly and Aron Culotta}, title = {Inferring latent attributes of Twitter users with label regularization}, booktitle = {NAACL/HLT}, note = {(26\% accepted)}, mytype = {Refereed Conference Publications}, year = {2015}, }

 Using Matched Samples to Estimate the Effects of Exercise\\ on Mental Health from Twitter Virgile Landeiro Dos Reis, Aron Culotta Twenty-ninth National Conference on Artificial Intelligence (AAAI), 2015 [pdf] [hide bib] [hide abstract]
 @inproceedings{virgile2015using, author = {Virgile Landeiro Dos Reis and Aron Culotta}, title = {Using Matched Samples to Estimate the Effects of Exercise\\ on Mental Health from Twitter}, booktitle = {Twenty-ninth National Conference on Artificial Intelligence (AAAI)}, mytype = {Refereed Conference Publications}, note = {(27\% accepted)}, year = {2015}, }

 Predicting the Demographics of Twitter Users from Website Traffic Data Aron Culotta, Nirmal Ravi Kumar, Jennifer Cutler Twenty-ninth National Conference on Artificial Intelligence (AAAI), 2015 [pdf] [hide bib] [hide abstract]
 @inproceedings{culotta2015predicting, author = {Aron Culotta and Nirmal Ravi Kumar and Jennifer Cutler}, title = {Predicting the Demographics of Twitter Users from Website Traffic Data}, booktitle = {Twenty-ninth National Conference on Artificial Intelligence (AAAI)}, year = {2015}, mytype = {Refereed Conference Publications}, note = {(27\% accepted); {\bf Best Paper (Honorable Mention})}, }

2014
 Reducing Sampling Bias in Social Media Data for County Health Inference \bf Aron Culotta JSM Proceedings, 2014 [pdf] [hide bib] [hide abstract]
 @inproceedings{culotta2014reducing, author = {{\bf Aron Culotta}}, title = {Reducing Sampling Bias in Social Media Data for County Health Inference}, booktitle = {JSM Proceedings}, year = {2014}, mytype = {Refereed Conference Publications}, url = {http://cs.iit.edu/~culotta/pubs/culotta14reducing.pdf}, }

 Anytime Active Learning Maria E Ramirez-Loaiza, \bf Aron Culotta, Mustafa Bilgic AAAI, 2014 [pdf] [hide bib] [hide abstract]
 @inproceedings{ramirez14any, author = {Maria E Ramirez-Loaiza and {\bf Aron Culotta} and Mustafa Bilgic}, title = {Anytime Active Learning}, booktitle = {AAAI}, year = {2014}, mytype = {Refereed Conference Publications}, url = {http://cs.iit.edu/~culotta/pubs/ramirez14any.pdf}, note = {(28\% accepted)}, }

 Using county demographics to infer attributes of Twitter users Ehsan Mohammady, \bf Aron Culotta ACL Joint Workshop on Social Dynamics and Personal Attributes in Social Media, 2014 [pdf] [hide bib] [hide abstract]
 @inproceedings{ehsan2014using, author = {Ehsan Mohammady and {\bf Aron Culotta}}, title = {Using county demographics to infer attributes of Twitter users}, booktitle = {ACL Joint Workshop on Social Dynamics and Personal Attributes in Social Media}, year = {2014}, mytype = {Refereed Workshop Publications}, url = {http://cs.iit.edu/~culotta/pubs/ehsan14using.pdf}, }

 Tweedr: Mining Twitter to Inform Disaster Response Zahra Ashktorab, Christopher Brown, Manojit Nandi, \bf Aron Culotta the 11th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2014 [pdf] [hide bib] [hide abstract]
 @inproceedings{ashktorab14tweedr, author = {Zahra Ashktorab and Christopher Brown and Manojit Nandi and {\bf Aron Culotta}}, title = {Tweedr: {M}ining {T}witter to Inform Disaster Response}, booktitle = {the 11th International Conference on Information Systems for Crisis Response and Management (ISCRAM)}, year = {2014}, url = {http://cs.iit.edu/~culotta/pubs/ashktorab14tweedr.pdf}, note = {(46\% accepted)}, mytype = {Refereed Conference Publications}, }

 Estimating County Health Statistics with Twitter \bf Aron Culotta Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2014 [pdf] [hide bib] [hide abstract]
 @inproceedings{culotta2014estimating, author = {{\bf Aron Culotta}}, title = {Estimating County Health Statistics with Twitter}, booktitle = {Proceedings of the {SIGCHI} Conference on Human Factors in Computing Systems}, mytype = {Refereed Conference Publications}, note = {(23\% accepted)}, url = {http://cs.iit.edu/~culotta/pubs/culotta14estimating.pdf}, year = {2014}, }

 Inferring the origin location of tweets with quantitative confidence Reid Priedhorsky, \bf Aron Culotta, Sara Y. Del Valle 17th ACM Conference on Computer Supported Cooperative Work and Social Computing, 2014 [pdf] [hide bib] [hide abstract]
 Social Internet content plays an increasingly critical role in many domains, including public health, disaster management, and politics. However, its utility is limited by missing geographic information; for example, fewer than 1.6\% of Twitter messages (*tweets*) contain a geotag. We propose a scalable, content-based approach to estimate the location of tweets using a novel yet simple variant of gaussian mixture models. Further, because real-world applications depend on quantified uncertainty for such estimates, we propose novel metrics of accuracy, precision, and calibration, and we evaluate our approach accordingly. Experiments on 13 million global, comprehensively multi-lingual tweets show that our approach yields reliable, well-calibrated results competitive with previous computationally intensive methods. We also show that a relatively small number of training data are required for good estimates (roughly 30,000 tweets), and models are quite time-invariant (effective on tweets many weeks newer than the training set). Finally, we show that toponyms and languages with small geographic footprint provide the most useful location signals.
 @inproceedings{priedhorsky2014inferring, author = {Reid Priedhorsky and {\bf Aron Culotta} and Sara Y. Del Valle}, title = {Inferring the origin location of tweets with quantitative confidence}, booktitle = {17th ACM Conference on Computer Supported Cooperative Work and Social Computing}, mytype = {Refereed Conference Publications}, year = {2014}, url = {http://cs.iit.edu/~culotta/pubs/priedhorsky14inferring.pdf}, note = {(27\% accepted); {\bf Best Paper (Honorable Mention})}, }

2013
 Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages \bf Aron Culotta Language Resources and Evaluation, Special Issue on Analysis of Short Texts on the Web, 2013 Preprint. Final version available at springer.com [pdf] [hide bib] [hide abstract]
 We analyze over 570 million Twitter messages from an eight month period and find that tracking a small number of keywords allows us to estimate influenza rates and alcohol sales volume with high accuracy. We validate our approach against government statistics and find strong correlations with influenza-like illnesses reported by the U.S. Centers for Disease Control and Prevention (r(14) = .964, p < .001) and with alcohol sales volume reported by the U.S. Census Bureau (r(5) = .932, p < .01). We analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise.
 @article{culotta13lightweight, author = {{\bf Aron Culotta}}, title = {Lightweight methods to estimate influenza rates and alcohol sales volume from {T}witter messages}, journal = {Language Resources and Evaluation, Special Issue on Analysis of Short Texts on the Web}, year = {2013}, mytype = {Journal Publications}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta12light.pdf}, }

 Towards Anytime Active Learning: Interrupting Experts to Reduce Annotation Costs Maria E. Ramirez-Loaiza, \bf Aron Culotta, Mustafa Bilgic KDD Workshop on Interactive Data Exploration and Analytics (IDEA), 2013 [pdf] [hide bib] [hide abstract]
 Many active learning methods use annotation cost or expert quality as part of their framework to select the best data for annotation. While these methods model expert quality, availability, or expertise, they have no direct influence on any of these elements. We present a novel framework built upon decision-theoretic active learning that allows the learner to directly control label quality by allocating a time budget to each annotation. We show that our method is able to improve performance efficiency of the active learner through an interruption mechanism trading off the induced error with the cost of annotation. Our simulation experiments on three document classification tasks show that some interruption is almost always better than none, but that the optimal interruption time varies by dataset.
 @inproceedings{ramirez13towards, author = {Maria E. Ramirez-Loaiza and {\bf Aron Culotta} and Mustafa Bilgic}, title = {Towards Anytime Active Learning: Interrupting Experts to Reduce Annotation Costs}, booktitle = {KDD Workshop on Interactive Data Exploration and Analytics (IDEA)}, mytype = {Refereed Workshop Publications}, url = {http://cs.iit.edu/~culotta/ramirez13towards.pdf}, year = {2013}, }

 Too Neurotic, Not too Friendly: Structured Personality Classification on Textual Data Francisco Iacobelli, \bf Aron Culotta ICWSM Workshop on Personality Classification, 2013 [pdf] [hide bib] [hide abstract]
 Personality plays a fundamental role in human interaction. With the increasing posting of text on the internet, automatic detection of a person's personality based on the text she produces is an important step to labeling and analyzing human behavior at a large scale. To date, most approaches to personality classification have modeled feature representations of the text to produce output classifications. In this paper we use structured classification approaches that learn and model both feature representations of text and dependencies between output labels (i.e. personality traits). Our study finds that there seems to be a correlation between Agreeableness and Emotional Stability and that it may be helping boost accuracy for Agreeableness when compared to more traditional approaches for supervised classification.
 @inproceedings{iacobelli13too, author = {Francisco Iacobelli and {\bf Aron Culotta}}, title = {Too Neurotic, Not too Friendly: Structured Personality Classification on Textual Data}, booktitle = {ICWSM Workshop on Personality Classification}, mytype = {Refereed Workshop Publications}, url = {http://cs.iit.edu/~culotta/iacobelli13too}, year = {2013}, }

2012
 A demographic analysis of online sentiment during Hurricane Irene Benjamin Mandel, \bf Aron Culotta, John Boulahanis, Danielle Stark, Bonnie Lewis, Jeremy Rodrigue NAACL-HLT Workshop on Language in Social Media, 2012 [pdf] [hide bib] [hide abstract]
 We examine the response to the recent natural disaster Hurricane Irene on Twitter.com. We collect over 65,000 Twitter messages relating to Hurricane Irene from August 18th to August 31st, 2011, and group them by location and gender. We train a sentiment classifier to categorize messages based on level of concern, and then use this classifier to investigate demographic differences. We report three principal findings: (1) the number of Twitter messages related to Hurricane Irene in directly affected regions peaks around the time the hurricane hits that region; (2) the level of concern in the days leading up to the hurricane's arrival is dependent on region; and (3) the level of concern is dependent on gender, with females being more likely to express concern than males. Qualitative linguistic variations further support these differences. We conclude that social media analysis provides a viable, real-time complement to traditional survey methods for understanding public perception towards an impending disaster.
 @inproceedings{mandel12demo, author = {Benjamin Mandel and {\bf Aron Culotta} and John Boulahanis and Danielle Stark and Bonnie Lewis and Jeremy Rodrigue}, title = {A demographic analysis of online sentiment during {H}urricane {I}rene}, booktitle = {NAACL-HLT Workshop on Language in Social Media}, shortbooktitle = {HLT/NAACL}, year = {2012}, mytype = {Refereed Workshop Publications}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/mandel12demo.pdf}, }

2011
 SampleRank: Training factor graphs with atomic gradients Michael Wick, Khashayar Rohanimanesh, Kedar Bellare, \bf Aron Culotta, Andrew McCallum Proceedings of the International Conference on Machine Learning (ICML), 2011 [pdf] [hide bib] [hide abstract]
 We present SampleRank, an alternative to contrastive divergence (CD) for estimating parameters in complex graphical models. SampleRank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. SampleRank is not only faster than CD, but also achieves better accuracy in practice (up to 23\% error reduction on noun-phrase coreference).
 @inproceedings{wick11sample, author = {Michael Wick and Khashayar Rohanimanesh and Kedar Bellare and {\bf Aron Culotta} and Andrew McCallum}, title = {SampleRank: Training factor graphs with atomic gradients}, booktitle = {Proceedings of the International Conference on Machine Learning (ICML)}, shortbooktitle = {ICML}, mytype = {Refereed Conference Publications}, year = {2011}, note = {(26\% accepted; 14 citations in Google Scholar)}, gscholar = {14667386768009965293}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/wick11sample.pdf}, }

2010
 Detecting influenza epidemics by analyzing Twitter messages \bf Aron Culotta arXiv:1007.4748v1 [cs.IR], 2010 Updates the KDD 2010 workshop paper. [pdf] [hide bib] [hide abstract]
 We analyze over 500 million Twitter messages from an eight month period and find that tracking a small number of flu-related keywords allows us to forecast future influenza rates with high accuracy, obtaining a 95\% correlation with national health statistics. We then analyze the robustness of this approach to spurious keyword matches, and we propose a document classification component to filter these misleading messages. We find that this document classifier can reduce error rates by over half in simulated false alarm experiments, though more research is needed to develop methods that are robust in cases of extremely high noise.
 @techreport{culotta10detecting, author = {{\bf Aron Culotta}}, title = {Detecting influenza epidemics by analyzing {T}witter messages}, howpublished = {arXiv:1007.4748v1 [cs.IR]}, mytype = {Technical Reports}, month = {July}, year = {2010}, gscholar = {16130235992031244410}, note = {(23 citations in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta10detecting.pdf}, }

 Towards detecting influenza epidemics by analyzing Twitter messages \bf Aron Culotta KDD Workshop on Social Media Analytics, 2010 Please also see ongoing work by Lampos and Cristianini, Tracking the flu epidemic by monitoring the Social Web (CIP 2010), and Lampos, De Bie, and Cristianini, Flu detector -- Tracking epidemics on Twitter (ECML PKDD 2010) [pdf] [hide bib] [hide abstract]
 Rapid response to a health epidemic is critical to reduce loss of life. Existing methods mostly rely on expensive surveys of hospitals across the country, typically with lag times of one to two weeks for influenza reporting, and even longer for less common diseases. In response, there have been several recently proposed solutions to estimate a population's health from Internet activity, most notably Google's Flu Trends service, which correlates search term frequency with influenza statistics reported by the Centers for Disease Control and Prevention (CDC). In this paper, we analyze messages posted on the micro-blogging site Twitter.com to determine if a similar correlation can be uncovered. We propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000 messages spanning 10 weeks, we find that our best model achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.
 @inproceedings{culotta10towards, author = {{\bf Aron Culotta}}, title = {Towards detecting influenza epidemics by analyzing {T}witter messages}, booktitle = {KDD Workshop on Social Media Analytics}, shortbooktitle = {KDD}, gscholar = {16823435859885561648}, mytype = {Refereed Workshop Publications}, and Lampos, De Bie, and Cristianini, Flu detector -- Tracking epidemics on Twitter (ECML PKDD 2010)}, note = {(46 citations in Google Scholar)}, year = {2010}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta10towards.pdf}, }

2009
 SampleRank: Learning preferences from atomic gradients Michael Wick, Khashayar Rohanimanesh, \bf Aron Culotta, Andrew McCallum Neural Information Processing Systems (NIPS) Workshop on Advances in Ranking, 2009 [pdf] [hide bib] [hide abstract]
 Large templated factor graphs with complex structure that changes during inference have been shown to provide state-of-the-art experimental results on tasks such as identity uncertainty and information integration. However, learning parameters in these models is difficult because computing the gradients require expensive inference routines. In this paper we propose an online algorithm that instead learns preferences over hypotheses from the gradients between the atomic steps of inference. Although there are a combinatorial number of ranking constraints over the entire hypothesis space, a connection to the frameworks of sampled convex programs reveals a polynomial bound on the number of rankings that need to be satisfied in practice. We further apply ideas of passive aggressive algorithms to our update rules, enabling us to extend recent work in confidence-weighted classification to structured prediction problems. We compare our algorithm to structured perceptron, contrastive divergence, and persistent contrastive divergence, demonstrating substantial error reductions on two real-world problems (20\% over contrastive divergence).
 @inproceedings{wick09sample, author = {Michael Wick and Khashayar Rohanimanesh and {\bf Aron Culotta} and Andrew McCallum}, title = {SampleRank: Learning preferences from atomic gradients}, booktitle = {Neural Information Processing Systems (NIPS) Workshop on Advances in Ranking}, shortbooktitle = {NIPS}, gscholar = {1187022239284508187}, mytype = {Refereed Workshop Publications}, year = {2009}, note = {(19 citations in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/wick09sample.pdf}, }

 An entity-based model for coreference resolution Michael Wick, \bf Aron Culotta, Khashayar Rohanimanesh, Andrew McCallum SIAM International Conference on Data Mining, 2009 [pdf] [hide bib] [hide abstract]
 Recently, many advanced machine learning approaches have been proposed for coreference resolution; however, all of the discriminatively-trained models reason over mentions rather than entities. That is, they do not explicitly contain variables indicating the 'canonical' values for each attribute of an entity (e.g., name, venue, title, etc.). This canonicalization step is typically implemented as a post-processing routine to coreference resolution prior to adding the extracted entity to a database. In this paper, we propose a discriminatively-trained model that jointly performs coreference resolution and canonicalization, enabling features over hypothesized entities. We validate our approach on two different coreference problems: newswire anaphora resolution and research paper citation matching, demonstrating improvements in both tasks and achieving an error reduction of up to 62\% when compared to a method that reasons about mentions only.
 @inproceedings{wick09entity, author = {Michael Wick and {\bf Aron Culotta} and Khashayar Rohanimanesh and Andrew McCallum}, title = {An entity-based model for coreference resolution}, booktitle = {SIAM International Conference on Data Mining}, shortbooktitle = {SDM}, mytype = {Refereed Conference Publications}, gscholar = {14148053855681269176}, note = {(16\% accepted; 26 citations in Google Scholar)}, year = {2009}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/wick09entity.pdf}, }

2008
 Learning and inference in weighted logic with application to natural language processing \bf Aron Culotta University of Massachusetts, 2008 Ph.D. Thesis [pdf] [hide bib] [hide abstract]
 Over the past two decades, statistical machine learning approaches to natural language processing have largely replaced earlier logic-based systems. These probabilistic methods have proven to be well-suited to the ambiguity inherent in human communication. However, the shift to statistical modeling has mostly abandoned the representational advantages of logic-based approaches. For example, many language processing problems can be more meaningfully expressed in first-order logic rather than propositional logic. Unfortunately, most machine learning algorithms have been developed for propositional knowledge representations. In recent years, there have been a number of attempts to combine logical and probabilistic approaches to artificial intelligence. However, their impact on real-world applications has been limited because of serious scalability issues that arise when algorithms designed for propositional representations are applied to first-order logic representations. In this thesis, we explore approximate learning and inference algorithms that are tailored for higher-order representations, and demonstrate that this synthesis of probability and logic can significantly improve the accuracy of several language processing systems.
 @phdthesis{culotta08learning, author = {{\bf Aron Culotta}}, title = {Learning and inference in weighted logic with application to natural language processing}, school = {University of Massachusetts}, year = {2008}, month = {May}, gscholar = {9770247215786675414}, mytype = {Thesis}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta08learning.pdf}, In recent years, there have been a number of attempts to combine logical and probabilistic approaches to artificial intelligence. However, their impact on real-world applications has been limited because of serious scalability issues that arise when algorithms designed for propositional representations are applied to first-order logic representations. In this thesis, we explore approximate learning and inference algorithms that are tailored for higher-order representations, and demonstrate that this synthesis of probability and logic can significantly improve the accuracy of several language processing systems.}, note = {(14 citations in Google Scholar)}, }

2007
 Canonicalization of Database Records using Adaptive Similarity Measures \bf Aron Culotta, Michael Wick, Robert Hall, Matthew Marzilli, Andrew McCallum Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2007 [pdf] [hide bib] [hide abstract]
 It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as {\sl canonicalization}. Despite its importance, there is very little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is central'' in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different {\sl styles} of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. {\sl KDD} versus {\sl Conference on Knowledge Discovery and Data Mining}). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. We empirically evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.
 @inproceedings{culotta07canonicalization, author = {{\bf Aron Culotta} and Michael Wick and Robert Hall and Matthew Marzilli and Andrew McCallum}, title = {Canonicalization of Database Records using Adaptive Similarity Measures}, year = {2007}, shortbooktitle = {KDD}, booktitle = {Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)}, address = {San Jose, CA}, mytype = {Refereed Conference Publications}, note = {(19\% accepted; 13 citations in Google Scholar)}, gscholar = {16644907311816950085}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta07canonicalization.pdf}, }

 Sparse Message Passing Algorithms for Weighted Maximum Satisfiability \bf Aron Culotta, Andrew McCallum, Bart Selman, Ashish Sabharwal New England Student Colloquium on Artificial Intelligence (NESCAI), 2007 [pdf] [hide bib] [hide abstract]
 Weighted maximum satisfiability is a well-studied problem that has important applicability to artificial intelligence (for instance, MAP inference in Bayesian networks). General-purpose stochastic search algorithms have proven to be accurate and efficient for large problem instances; however, these algorithms largely ignore structural properties of the input. For example, many problems are highly {\sl clustered}, in that they contain a collection of loosely coupled subproblems (e.g. pipelines of NLP tasks). In this paper, we propose a message passing algorithm to solve weighted maximum satisfiability problems that exhibit this clustering property. Our algorithm fuses local solutions to each subproblem into a global solution by iteratively passing summary information between clusters and recomputing local solutions. Because the size of these messages can become unwieldy for large problems, we explore several message compression techniques to transmit the most valuable information as compactly as possible. We empirically compare our algorithm against a state-of-the-art stochastic solver and show that for certain classes of problems our message passing algorithm finds significantly better solutions.
 @inproceedings{culotta07sparse, author = {{\bf Aron Culotta} and Andrew McCallum and Bart Selman and Ashish Sabharwal}, title = {Sparse Message Passing Algorithms for Weighted Maximum Satisfiability}, booktitle = {New England Student Colloquium on Artificial Intelligence (NESCAI)}, address = {Ithaca, NY}, year = {2007}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta07sparse.pdf}, mytype = {Unrefereed Workshop Publications}, }

 Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function \bf Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew McCallum Sixth International Workshop on Information Integration on the Web (IIWeb-07), 2007 [pdf] [hide bib] [hide abstract]
 Author disambiguation is the problem of determining whether records in a publications database refer to the same person. A common supervised machine learning approach is to build a classifier to predict whether a {\sl pair} of records is coreferent, followed by a clustering step to enforce transitivity. However, this approach ignores powerful evidence obtainable by examining {\sl sets} (rather than {\sl pairs}) of records, such as the number of publications or co-authors an author has. In this paper we propose a representation that enables these {\sl first-order features} over sets of records. We then propose a training algorithm well-suited to this representation that is (1) {\sl error-driven} in that training examples are generated from incorrect predictions on the training data, and (2) {\sl rank-based} in that the classifier induces a {\sl ranking} over candidate predictions. We evaluate our algorithms on three author disambiguation datasets and demonstrate error reductions of up to 60\% over the standard binary classification approach.
 @inproceedings{culotta07author, author = {{\bf Aron Culotta} and Pallika Kanani and Robert Hall and Michael Wick and Andrew McCallum}, title = {Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function}, booktitle = {Sixth International Workshop on Information Integration on the Web (IIWeb-07)}, gscholar = {2889598203454667489}, year = {2007}, address = {Vancouver, Canada}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta07author.pdf}, mytype = {Refereed Workshop Publications}, note = {(31 citations in Google Scholar)}, }

 First-Order Probabilistic Models for Coreference Resolution \bf Aron Culotta, Michael Wick, Robert Hall, Andrew McCallum Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL), 2007 [pdf] [hide bib] [hide abstract]
 Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over {\sl sets} of noun phrases, resulting in a first-order probabilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our method to the ACE coreference dataset, achieving a 45\% error reduction over a comparable method that only considers features of pairs of noun phrases. This result demonstrates an example of how a first-order logic representation can be incorporated into a probabilistic model and scaled efficiently.
 @inproceedings{culotta07first, author = {{\bf Aron Culotta} and Michael Wick and Robert Hall and Andrew McCallum}, title = {First-Order Probabilistic Models for Coreference Resolution}, shortbooktitle = {HLT/NAACL}, booktitle = {Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL)}, year = {2007}, location = {Rochester, NY}, pages = {81--88}, gscholar = {18088379118519954125}, note = {(24\% accepted; \textbf{105 citations} in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta07first.pdf}, mytype = {Refereed Conference Publications}, }

2006
 Corrective Feedback and Persistent Learning for Information Extraction \bf Aron Culotta, Trausti Kristjansson, Andrew McCallum, Paul Viola Artificial Intelligence, 2006 Preprint. Journal version of AAAI04/AAAI05 papers with new experiments. [pdf] [hide bib] [hide abstract]
 To successfully embed statistical machine learning models in real world applications, two post-deployment capabilities must be provided: (1) the ability to solicit user corrections and (2) the ability to update the model from these corrections. We refer to the former capability as {\sl corrective feedback} and the latter as {\sl persistent learning}. While these capabilities have a natural implementation for simple classification tasks such as spam filtering, we argue that a more careful design is required for {\sl structured classification} tasks. One example of a structured classification task is {\sl information extraction}, in which raw text is analyzed to automatically populate a database. In this work, we augment a probabilistic information extraction system with corrective feedback and persistent learning components to assist the user in building, correcting, and updating the extraction model. We describe methods of guiding the user to incorrect predictions, suggesting the most {\sl informative} fields to correct, and incorporating corrections into the inference algorithm. We also present an active learning framework that minimizes not only how many examples a user must label, but also how difficult each example is to label. We empirically validate each of the technical components in simulation and quantify the user effort saved. We conclude that more efficient corrective feedback mechanisms lead to more effective persistent learning.
 @article{culotta06corrective, author = {{\bf Aron Culotta} and Trausti Kristjansson and Andrew McCallum and Paul Viola}, title = {Corrective Feedback and Persistent Learning for Information Extraction}, journal = {Artificial Intelligence}, year = {2006}, volume = {170}, gscholar = {12331814406613788118}, pages = {1101--1122}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta06corrective.pdf}, mytype = {Journal Publications}, note = {(40 citations in Google Scholar)}, }

 Tractable Learning and Inference with High-Order Representations \bf Aron Culotta, Andrew McCallum International Conference on Machine Learning Workshop on Open Problems in Statistical Relational Learning, 2006 [pdf] [hide bib] [hide abstract]
 Representing high-order interactions in data often results in large models with an intractable number of hidden variables. In these models, inference and learning must operate without instantiating the entire set of variables. This paper presents a Metropolis-Hastings sampling approach to address this issue, and proposes new methods to discriminatively estimate the proposal and target distribution of the sampler using a ranking function over configurations. We demonstrate our approach on the task of paper and author deduplication, showing that our method enables complex, advantageous representations of the data while maintaining tractable learning and inference procedures.
 @inproceedings{culotta06tractable, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {Tractable Learning and Inference with High-Order Representations}, booktitle = {International Conference on Machine Learning Workshop on Open Problems in Statistical Relational Learning}, address = {Pittsburgh, PA}, year = {2006}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta06tractable.pdf}, mytype = {Refereed Workshop Publications}, gscholar = {7673238583217369958}, note = {(11 citations in Google Scholar)}, }

 Learning field compatibilities to extract database records from unstructured text Michael Wick, \bf Aron Culotta, Andrew McCallum Conference on Empirical Methods in Natural Language Processing (EMNLP), 2006 [pdf] [hide bib] [hide abstract]
 Named-entity recognition systems extract entities such as people, organizations, and locations from unstructured text. Rather than extract these mentions in isolation, this paper presents a {\sl record extraction} system that assembles mentions into records (i.e. database tuples). We construct a probabilistic model of the compatibility between field values, then employ graph partitioning algorithms to cluster fields into cohesive records. We also investigate compatibility functions over {\sl sets} of fields, rather than simply pairs of fields, to examine how higher representational power can impact performance. We apply our techniques to the task of extracting contact records from faculty and student homepages, demonstrating a 53\% error reduction over baseline approaches.
 @inproceedings{wick06learning, author = {Michael Wick and {\bf Aron Culotta} and Andrew McCallum}, title = {Learning field compatibilities to extract database records from unstructured text}, shortbooktitle = {EMNLP}, booktitle = {Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2006}, gscholar = {11137875592707978964}, address = {Sydney, Australia}, note = {(18\% accepted; 21 citations in Google Scholar)}, pages = {603--611}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/wick06learning.pdf}, mytype = {Refereed Conference Publications}, gscholar = {11137875592707978964}, }

 Practical Markov logic containing first-order quantifiers with application to identity uncertainty \bf Aron Culotta, Andrew McCallum Human Language Technology Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing (HLT/NAACL), 2006 Workshop version of tech report IR-430 [pdf] [hide bib] [hide abstract]
 {\sl Markov logic} is a highly expressive language recently introduced to specify the connectivity of a Markov network using first-order logic. While Markov logic is capable of constructing arbitrary first-order formulae over the data, the complexity of these formulae is often limited in practice because of the size and connectivity of the resulting network. In this paper, we present approximate inference and estimation methods that incrementally instantiate portions of the network as needed to enable first-order existential and universal quantifiers in {\sl Markov logic networks}. When applied to the problem of identity uncertainty, this approach results in a conditional probabilistic model that can reason about objects, combining the expressivity of recently introduced BLOG models with the predictive power of conditional training. We validate oualgorithms on the tasks of citation matching and author disambiguation.
 @inproceedings{culotta06practical, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {Practical Markov logic containing first-order quantifiers with application to identity uncertainty}, booktitle = {Human Language Technology Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing (HLT/NAACL)}, year = {2006}, month = {June}, gscholar = {1816738532431989883}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta06practical.pdf}, mytype = {Refereed Workshop Publications}, note = {(10 citations in Google Scholar)}, }

 Integrating probabilistic extraction models and data mining to discover relations and patterns in text \bf Aron Culotta, Andrew McCallum, Jonathan Betz Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL), 2006 [pdf] [hide bib] [hide abstract]
 In order for relation extraction systems to obtain human-level performance, they must be able to incorporate relational patterns inherent in the data (for example, that one's sister is likely one's mother's daughter, or that children are likely to attend the same college as their parents). Hand-coding such knowledge can be time-consuming and inadequate. Additionally, there may exist many interesting, unknown relational patterns that both improve extraction performance and provide insight into text. We describe a probabilistic extraction model that provides mutual benefits to both top-down'' relational pattern discovery and bottom-up'' relation extraction.
 @inproceedings{culotta06integrating, author = {{\bf Aron Culotta} and Andrew McCallum and Jonathan Betz}, title = {Integrating probabilistic extraction models and data mining to discover relations and patterns in text}, shortbooktitle = {HLT-NAACL}, booktitle = {Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT/NAACL)}, year = {2006}, pages = {296--303}, address = {New York, NY}, month = {June}, note = {(25\% accepted; {\bf 85 citations} in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta06integrating.pdf}, mytype = {Refereed Conference Publications}, gscholar = {7373044105135117265}, }

2005
 Learning clusterwise similarity with first-order features \bf Aron Culotta, Andrew McCallum Neural Information Processing Systems (NIPS) Workshop on the Theoretical Foundations of Clustering, 2005 [pdf] [hide bib] [hide abstract]
 Many clustering problems can be reduced to the task of partitioning a weighted graph into highly-connected components. The weighted edges indicate pairwise similarity between two nodes and can often be estimated from training data. However, in many domains, there exist higher-order dependencies not captured by pairwise metrics. For example, there may exist soft constraints on aggregate features of an entire cluster, such as its size, mean or mode. We propose {\sl clusterwise} similarity metrics to directly measure the cohesion of an entire cluster of points. We describe ways to learn a clusterwise metric from labeled data, using weighted, first-order features over clusters. Extending recent work equating graph partitioning with inference in graphical models, we frame this approach within a discriminatively-trained Markov network. The advantages of our approach are demonstrated on the task of {\sl coreference resolution}.
 @inproceedings{culotta05learning, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {Learning clusterwise similarity with first-order features}, booktitle = {Neural Information Processing Systems (NIPS) Workshop on the Theoretical Foundations of Clustering}, year = {2005}, address = {Whistler, B.C.}, month = {December}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta05learning.pdf}, mytype = {Refereed Workshop Publications}, }

 A conditional model of deduplication for multi-type relational data \bf Aron Culotta, Andrew McCallum University of Massachusetts IR-443, 2005 [pdf] [hide bib] [hide abstract]
 Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the merge decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. We evaluate the system on two citation matching datasets, for which we deduplicate both papers and venues. We show that by collectively deduplicating paper and venue records, we obtain up to a 30\% error reduction in venue deduplication, and up to a 20\% error reduction in paper deduplication over competing methods.
 @techreport{culotta05conditional, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {A conditional model of deduplication for multi-type relational data}, institution = {University of Massachusetts}, month = {September}, year = {2005}, gscholar = {11884128999388128677}, number = {IR-443}, mytype = {Technical Reports}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta05conditional.pdf}, note = {(11 citations in Google Scholar)}, }

 Joint deduplication of multiple record types in relational data \bf Aron Culotta, Andrew McCallum 2005 ACM CIKM International Conference on Information and Knowledge Management, 2005 Poster version of Technical Report IR-443 [pdf] [hide bib] [hide abstract]
 Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the merge decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. We evaluate the system on two citation matching datasets, for which we deduplicate both papers and venues. We show that by collectively deduplicating paper and venue records, we obtain up to a 30\% error reduction in venue deduplication, and up to a 20\% error reduction in paper deduplication over competing methods.
 @inproceedings{culotta05joint, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {Joint deduplication of multiple record types in relational data}, shortbooktitle = {CIKM}, booktitle = {2005 ACM CIKM International Conference on Information and Knowledge Management}, year = {2005}, pages = {257--258}, note = {(25\% accepted; 48 citations in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta05joint.pdf}, mytype = {Refereed Conference Publications}, gscholar = {11197552443106103089}, }

 Reducing labeling effort for structured prediction tasks \bf Aron Culotta, Andrew McCallum The Twentieth National Conference on Artificial Intelligence (AAAI), 2005 [pdf] [hide bib] [hide abstract]
 A common obstacle preventing the rapid deployment of supervised machine learning algorithms is the lack of labeled training data. This is particularly expensive to obtain for structured prediction tasks, where each training instance may have multiple, interacting labels, all of which must be correctly annotated for the instance to be of use to the learner. Traditional active learning addresses this problem by optimizing the order in which the examples are labeled to increase learning efficiency. However, this approach does not consider the {\sl difficulty} of labeling each example, which can vary widely in structured prediction tasks. For example, the labeling predicted by a partially trained system may be easier to correct for some instances than for others. We propose a new active learning paradigm which reduces not only {\sl how many} instances the annotator must label, but also {\sl how difficult} each instance is to annotate. The system also leverages information from partially correct predictions to efficiently solicit annotations from the user. We validate this active learning framework in an interactive information extraction system, reducing the total number of annotation actions by 22\%.
 @inproceedings{culotta05reducing, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {Reducing labeling effort for structured prediction tasks}, shortbooktitle = {AAAI}, booktitle = {The Twentieth National Conference on Artificial Intelligence (AAAI)}, address = {Pittsburgh, PA}, pages = {746--751}, year = {2005}, note = {(27\% accepted; 53 citations in Google Scholar)}, gscholar = {8538497340439286949}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta05reducing.pdf}, mytype = {Refereed Conference Publications}, }

 Gene prediction with conditional random fields \bf Aron Culotta, David Kulp, Andrew McCallum University of Massachusetts, Amherst UM-CS-2005-028, 2005 [pdf] [hide bib] [hide abstract]
 Given a sequence of DNA nucleotide bases, the task of gene prediction is to find subsequences of bases that encode proteins. Reasonable performance on this task has been achieved using generatively trained sequence models, such as hidden Markov models. We propose instead the use of a discriminitively trained sequence model, the conditional random field (CRF). CRFs can naturally incorporate arbitrary, non-independent features of the input without making conditional independence assumptions among the features. This can be particularly important for gene finding, where including evidence from protein databases, EST data, or tiling arrays may improve accuracy. We evaluate our model on human genomic data, and show that CRFs perform better than HMM-based models at incorporating homology evidence from protein databases, achieving a 10\% reduction in base-level errors.
 @techreport{culotta05gene, author = {{\bf Aron Culotta} and David Kulp and Andrew McCallum}, title = {Gene prediction with conditional random fields}, month = {April}, year = {2005}, gscholar = {9410986896755606720}, number = {UM-CS-2005-028}, institution = {University of Massachusetts, Amherst}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta05gene.pdf}, mytype = {Technical Reports}, gscholar = {9410986896755606720}, note = {(41 citations in Google Scholar)}, }

2004
 Dependency tree kernels for relation extraction \bf Aron Culotta, Jeffery Sorensen 42nd Annual Meeting of the Association for Computational Linguistics (ACL), 2004 [pdf] [hide bib] [hide abstract]
 We extend previous work on tree kernels to estimate the similarity between the dependency trees of sentences. Using this kernel within a Support Vector Machine, we detect and classify relations between entities in the Automatic Content Extraction (ACE) corpus of news articles. We examine the utility of different features such as Wordnet hypernyms, parts of speech, and entity types, and find that the dependency tree kernel achieves a 20\% F1 improvement over a bag-of-words'' kernel.
 @inproceedings{culotta04dependency, author = {{\bf Aron Culotta} and Jeffery Sorensen}, title = {Dependency tree kernels for relation extraction}, shortbooktitle = {ACL}, booktitle = {42nd Annual Meeting of the Association for Computational Linguistics (ACL)}, address = {Barcelona, Spain}, year = {2004}, gscholar = {13678483547299399330}, note = {(25\% accepted; {\bf 419 citations} in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04dependency.pdf}, mytype = {Refereed Conference Publications}, }

 Interactive information extraction with constrained conditional random fields Trausti Kristjannson, \bf Aron Culotta, Paul Viola, Andrew McCallum Nineteenth National Conference on Artificial Intelligence (AAAI), 2004 Best Paper Award (Honorable Mention) [pdf] [hide bib] [hide abstract]
 Information Extraction methods can be used to automatically fill-in'' database forms from unstructured data such as Web documents or email. State-of-the-art methods have achieved low error rates but invariably make a number of errors. The goal of an {\em interactive information extraction} system is to assist the user in filling in database fields while giving the user confidence in the integrity of the data. The user is presented with an interactive interface that allows both the rapid verification of automatic field assignments and the correction of errors. In cases where there are multiple errors, our system takes into account user corrections, and immediately propagates these constraints such that other fields are often corrected automatically. Linear-chain conditional random fields (CRFs) have been shown to perform well for information extraction and other language modelling tasks due to their ability to capture arbitrary, overlapping features of the input in a Markov model. We apply this framework with two extensions: a constrained Viterbi decoding which finds the optimal field assignments consistent with the fields explicitly specified or corrected by the user; and a mechanism for estimating the confidence of each extracted field, so that low-confidence extractions can be highlighted. Both of these mechanisms are incorporated in a novel user interface for form filling that is intuitive and speeds the entry of data---providing a 23\% reduction in error due to automated corrections.
 @inproceedings{kristjannson04interactive, author = {Trausti Kristjannson and {\bf Aron Culotta} and Paul Viola and Andrew McCallum}, title = {Interactive information extraction with constrained conditional random fields}, shortbooktitle = {AAAI}, booktitle = {Nineteenth National Conference on Artificial Intelligence (AAAI)}, address = {San Jose, CA}, year = {2004}, note = {(26\% accepted; {\bf Best Paper Honorable Mention}; {\bf 114 citations} in Google Scholar)}, gscholar = {4191142324560574413}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/kristjannson04interactive.pdf}, mytype = {Refereed Conference Publications}, }

 Confidence estimation for information extraction \bf Aron Culotta, Andrew McCallum Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL), 2004 [pdf] [hide bib] [hide abstract]
 Information extraction techniques automatically create structured databases from unstructured data sources, such as the Web or newswire documents. Despite the successes of these systems, accuracy will always be imperfect. For many reasons, it is highly desirable to accurately estimate the confidence the system has in the correctness of each extracted field. The information extraction system we evaluate is based on a linear-chain conditional random field (CRF), a probabilistic model which has performed well on information extraction tasks because of its ability to capture arbitrary, overlapping features of the input in a Markov model. We implement several techniques to estimate the confidence of both extracted fields and entire multi-field records, obtaining an average precision of 98\% for retrieving correct fields and 87\% for multi-field records.
 @inproceedings{culotta04confidence, author = {{\bf Aron Culotta} and Andrew McCallum}, title = {Confidence estimation for information extraction}, shortbooktitle = {HLT}, booktitle = {Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT/NAACL)}, address = {Boston, MA}, year = {2004}, note = {(26\% accepted; {\bf 83 citations} in Google Scholar)}, gscholar = {14437617679615248208}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04confidence.pdf}, mytype = {Refereed Conference Publications}, }

 Extracting social networks and contact information from email and the Web \bf Aron Culotta, Ron Bekkerman, Andrew McCallum First Conference on Email and Anti-Spam (CEAS), 2004 [pdf] [hide bib] [hide abstract]
 We present an end-to-end system that extracts a user's social network and its members' contact information given the user's email inbox. The system identifies unique people in email, finds their Web presence, and automatically fills the fields of a contact address book using conditional random fields---a type of probabilistic model well-suited for such information extraction tasks. By recursively calling itself on new people discovered on the Web, the system builds a social network with multiple degrees of separation from the user. Additionally, a set of expertise-describing keywords are extracted and associated with each person. We outline the collection of statistical and learning components that enable this system, and present experimental results on the real email of two users; we also present results with a simple method of learning transfer, and discuss the capabilities of the system for address-book population, expert-finding, and social network analysis.
 @inproceedings{culotta04extracting, author = {{\bf Aron Culotta} and Ron Bekkerman and Andrew McCallum}, title = {Extracting social networks and contact information from email and the Web}, shortbooktitle = {CEAS}, booktitle = {First Conference on Email and Anti-Spam (CEAS)}, address = {Mountain View, CA}, year = {2004}, gscholar = {16952041749780843284}, note = {(35\% accepted; {\bf 200 citations} in Google Scholar)}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta04extracting.pdf}, mytype = {Refereed Conference Publications}, }

2003
 Maximizing cascades in social networks \bf Aron Culotta University of Massachusetts, 2003 [pdf] [hide bib] [hide abstract]
 Cascades are a network phenomenon in which small local shocks can result in wide-spread fads, strikes, innovations, or power failures. Determining which initial shocks will result in the largest cascades is of interest to marketing, epidemiology, and computer networking. This paper surveys work in maximizing the size of cascades in social networks.
 @techreport{culotta03maximizing, author = {{\bf Aron Culotta}}, title = {Maximizing cascades in social networks}, institution = {University of Massachusetts}, year = {2003}, url = {http://www2.selu.edu/Academics/Faculty/aculotta/pubs/culotta03maximizing.pdf}, mytype = {Technical Reports}, }