Topics in Machine Learning
Topics in Machine Learning
CSCI 3907-82/6907-82 | George Washington University | FALL 2011
Instructor: Claire Monteleoni
When: Tuesdays, 6:10pm-8:40pm
Wiki: Please consult the course wiki for further information.
This course will introduce students to several active and cutting-edge areas of machine learning research. Machine learning research involves designing principled algorithms to learn from real data sources. Profuse amounts of digital data are being generated from a myriad of sources. Satellites record vast sequences of high-dimensional images, and environmental sensors continuously generate measurements of temperature and atmospheric gases. The increasing reliance on the internet for daily tasks also fuels this rapid data growth. Machine learning has made profound impacts on applications as varied as web search, DNA analysis, computer vision, and natural language processing, to name just a few. Real data sources increasingly pose interesting and urgent challenges for machine learning algorithm design; the data can be vast, high-dimensional, streaming, noisy, time-varying, or it may combine these and other attributes. The primary topics studied in this course will be clustering, learning from data streams, and Climate Informatics.*
Clustering: Most real data sources produce raw data (e.g. speech signal, or images on the web), that is not yet labeled for any classification task, which motivates the study of unsupervised learning. Clustering refers to a broad class of unsupervised learning tasks aimed at partitioning the data into “clusters” that are appropriate to the specific application. Clustering techniques are widely used in practice, in order to summarize large quantities of data (e.g. aggregating similar online news stories).
Learning from data streams: As data sources continue to grow at an unprecedented rate, it is increasingly important that algorithms to analyze this data operate in online, or streaming settings. These settings are applicable to a variety of streaming data applications, including forecasting, real-time decision making, and resource-constrained learning. Data streams can take many forms, such as stock prices, weather measurements, and internet transactions, or any data set that is so large compared to computational resources, that algorithms must access it in a sequential manner. The goal is to design algorithms to learn from data streams, that are very light-weight with respect to computation and memory usage.
Climate Informatics: The threat of climate change is one of the greatest challenges currently facing society. With an ever-growing supply of climate data from satellites and sensors, machine learning is poised to make a profound impact on climate science, just as it has on other natural sciences to which it has been applied (e.g. Bioinformatics). Climate Informatics is a new research field involving collaborations between machine learning (as well as data mining and statistics) and climate science, in order to accelerate progress in answering pressing questions in climate science.
Along the way, unsupervised learning, semi-supervised learning, active learning, and online learning, including with expert predictors, and from time-varying data, will also be discussed. Lectures will also expose students to randomized algorithms, and approximation algorithms.
There will be a final project which can be theoretical or applied, and could lead to publication. Topics and open questions will be introduced throughout the course, and students will receive supervision on their projects during the semester.
Preparation: Coursework in algorithms, probability theory or statistics, and linear algebra is recommended. Previous exposure to machine learning is desirable but not required. Students with non-traditional preparation should ask the instructor.
*NOTE: The topics covered this semester are subject to change, and student input is welcomed.
