INLS 613: Text Mining

Objective: Gain experience with both the theoretical and practical aspects of text mining. Learn how to build and evaluate computer programs that generate new knowledge from natural language text.
Description: Changes in technology and publishing practices have eased the task of recording and sharing textual information electronically. This increased quantity of information has spurred the development of a new field called text mining. The overarching goal of this new field is to use computers to automatically learn new things from textual data.

The course is divided into three modules: basics, principles, and applications (see details below). The third part of the course will focus on several applications of text mining: methods for automatically organizing textual documents for sense-making and navigation (clustering and classification), methods for detecting opinion and bias, methods for detecting and resolving specific entities in text (information extraction and resolution), and methods for learning new relations between entities (relation extraction). Throughout the course, a strong emphasis will be placed on evaluation. Students will develop a deep understanding of one particular method through a course project.

Prerequisites: Students should have a reasonable background in programming in a structured or object oriented programming language, such as Java or C++. "Reasonable" means either coursework or equivalent practical experience. You should be able to design, implement, debug and test small to medium sized programs. If you would like to take this course, but do not know if you meet these pre-requisites, please send me an email.
Time & Location: M,W 2:00-3:15 pm, Manning 304
Instructor: Jaime Arguello (email, web)
Office Hours: T, Th 9:30-10:30am, Manning 305
Required Textbook: Data Mining: Practical Machine Learning Tools and Techniques (Third Edition) Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Morgan Kaufman. ISBN 978-0-12-374856-0. Available online or in the campus bookstore.
Additional Resources: TBD
Course Policies: Attendance, Participation, Collaboration, Plagiarism & Cheating, Late Policy
Grading: 10% Class participation
15% Midterm Exam
40% Homework (10% each)
35% Final project (5% project proposal, 20% project report, 10% project presentation)
Grade Assignments: Letter grades will be assigned using the following scale: H 95-100%, P 80-94%, L 60-79%, and F 0-59%.
Topics: Subject to change! Readings from the required textbook (Witten, Frank, and Hall) is marked with a WFH bellow.
Lecture Date Events Topic Readings Due
1 Wed. 8/22   Course Overview  
2 Mon. 8/27   Text Mining: Predictive and Exploratory Analysis of Text WFH Ch. 1, Mitchell '06, Hearst '99
3 Wed. 8/29 HW1 out Predictive Analysis: Concepts, Features, and Instances WFH Ch. 2
4 Mon. 9/3 Labor Day (no class) No class  
5 Wed. 9/5   Predictive Analysis: Concepts, Features, and Instances  
6 Mon. 9/10   Text Representation  
7 Wed. 9/12 HW1 due, Accuracy Histogram Text Representation  
8 Mon. 9/17 Lab: Manning 117, HW2 out, train.csv, test.csv LightSIDE Tutorial LightSIDE User's Manual
9 Wed. 9/19   Basic Machine Learning Algorithms (Naïve Bayes) WFH Ch. 4.2
10 Mon. 9/24 Lab: Manning 117, datasets Weka Tutorial WFH Ch. 10 and 11
11 Wed. 9/26 Guest Lecture: Patricia Amaral Linguistic Analysis of Text  
12 Mon. 10/1 HW2 due Basic Machine Learning Algorithms (Instance-based Classification) WFH Ch. 4.7
13 Wed. 10/3   Predictive Analysis: Experimentation and Evaluation WFH Ch. 5
14 Mon. 10/8 Term Project Proposal Due Predictive Analysis: Experimentation and Evaluation  
15 Wed. 10/10   Predictive Analysis: Experimentation and Evaluation  
16 Mon. 10/15   Predictive Analysis with Noisy Labels Sheng et al., '08
17 Wed. 10/17 Midterm, Midterm Solutions Midterm  
18 Mon. 10/22 HW 3 out, HW3 data Exploratory Analysis: Clustering I Manning Ch. 16
19 Wed. 10/24   Exploratory Analysis: Clustering II  
20 Mon. 10/29   Class Cancelled (work on your term projects!)  
21 Wed. 10/31   Class Cancelled (work on your term projects!)  
22 Mon. 11/5 HW 3 due Sentiment Analysis I Pang and Lee, '08 (skip Section 5 and only skim Section 6), Pang and Lee, '02
23 Wed. 11/7 Guest Lecture: Annie Chen Sentiment Analysis II Somasundaran and Weibe '10 (optional), Naveed et al. '11
24 Mon. 11/12 HW 4 out, HW4 data Detecting Viewpoint and Perspective Yano et al., '10, Weibe '10
25 Wed. 11/14   Text-based Forecasting O'connor et al., '10, Lerman et al., '08
26 Mon. 11/19   Information Extraction McCallum '05
27 Wed. 11/21 No Class (Thanksgiving)
28 Mon. 11/26 HW4 due Course Conclusion  
29 Wed. 11/28   Project Presentations  
30 Mon. 12/3   Project Presentations  
31 Wed. 12/5   Project Presentations  
32 Fri. 12/7 Project Report due