INLS 613: Text Mining

Objective: Gain experience with both the theoretical and practical aspects of text mining. Learn how to build and evaluate computer programs that generate new knowledge from natural language text.
Description: Changes in technology and publishing practices have eased the task of recording and sharing textual information electronically. This increased quantity of information has spurred the development of a new field called text mining. The overarching goal of this new field is to use computers to automatically learn new things from textual data.

The course is divided into three modules: basics, principles, and applications (see details below). The third part of the course will focus on several applications of text mining: methods for automatically organizing textual documents for sense-making and navigation (clustering and classification), methods for detecting opinion and bias, methods for detecting and resolving specific entities in text (information extraction and resolution), and methods for learning new relations between entities (relation extraction). Throughout the course, a strong emphasis will be placed on evaluation. Students will develop a deep understanding of one particular method through a course project.

Prerequisites: There are no prerequisites for this course. We will be using a tool called LightSIDE to train and test machine learned models for different predictive tasks. LightSIDE has a graphical user interface that makes it easy to do this without knowing how to program. That being said, knowing how to program (and manipulate text) may enable you to conduct more interesting experiments as part of your final project.
This course will involve understanding mathematical concepts and procedures. I will cover the basics in order for you to understand these. However, if you strongly dislike math and are unwilling to grapple with and ultimately conquer mathematical concepts and procedures, this may not be a good course for you.
Time & Location: T, Th 12:30pm-1:45pm, Manning 304 (In Person).
Instructor: Jaime Arguello (email, web)
Office Hours: By Appointment
Required Textbook: Data Mining: Practical Machine Learning Tools and Techniques (Fourth Edition) Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2017. Morgan Kaufman. ISBN 978-0128042915. Available online
Additional Resources: Foundations of Statistical Natural Language Processing. C. Manning and H Schutze. 1999.

Introduction to Information Retrieval. C. Manning, P. Raghavan and H. Schutze. 2008.
Course Policies: Laptops, Attendance, Participation, Collaboration, Plagiarism & Cheating, Late Policy, Use of Generative AI Tools
Grading: 10% Class participation
20% Midterm Exam
30% Homework (10% each)
40% Final project (5% project proposal, 25% project report, 10% project presentation)
Grade Assignments: Undergraduate grading scale: A+ 97-100%, A 94-96%, A- 90-93%, B+ 87-89%, B 84-86, B- 80-83%, C+ 77-79%, C 74-76%, C- 70-73%, D+ 67-69%, D 64-66%, D- 60-63%, F 0-59%

Graduate grading scale: H 95-100%, P 80-94%, L 60-79%, and F 0-59%.
Topics: Subject to change! Readings from the required textbook (Witten, Frank, Hall, and Pal) is marked with a WFHPP bellow.
Lecture Date Events Topic Reading Due
1 Thu. 1/9   Introduction to Text Mining: The Big Picture  
2 Tue. 1/14   Course Overview: Roadmap and Expectations WFH Ch. 1, Mitchell '06
3 Thu. 1/16   Predictive Analysis: Concepts, Features, and Instances I WFH Ch. 2, Dominigos '12
4 Tue. 1/21 HW1 Out Predictive Analysis: Concepts, Features, and Instances II  
5 Thu. 1/23   Text Representation I  
6 Tue. 1/28   Text Representation II  
7 Thu. 1/30   LighSIDE Tutorial LightSIDE User Manual
8 Tue. 2/4 HW1 Due Machine Learning Algorithms: Naïve Bayes I WFH Ch. 4.2, Mitchell Sections 1 and 2
9 Thu. 2/6   Machine Learning Algorithms: Naïve Bayes II  
10 Tue. 2/11   Machine Learning Algorithms: Instance-based Classification I WFH Ch. 4.7
11 Thu. 2/13   Machine Learning Algorithms: Instance-based Classification II  
12 Tue. 2/18 HW2 Out Machine Learning Algorithms: Linear Classifiers I WFH 3.2 and 4.6
13 Thu. 2/20 Literature Review Proposal Due Machine Learning Algorithms: Linear Classifiers II  
14 Tue. 2/25   Midterm Review  
15 Thu. 2/27   Midterm  
16 Tue. 3/4 HW2 Due Final Project Breakout Group Discussion I  
17 Thu. 3/6   Predictive Analysis: Experimentation and Evaluation I WFH Ch. 5
18 Tue. 3/11 Spring Break (No Class)    
19 Thu. 3/13 Spring Break (No Class)    
20 Tue. 3/18   Predictive Analysis: Experimentation and Evaluation II Smucker et al., '07, Cross-Validation, Parameter Tunning and Overfitting
21 Thu. 3/20   Predictive Analysis: Experimentation and Evaluation III  
22 Tue. 3/25 HW3 Out Final Project Breakout Group Discussion II  
23 Thu. 3/27   Exploratory Analysis: Clustering I Manning Ch. 16
24 Tue. 4/1   Exploratory Analysis: Clustering II  
25 Thu. 4/3   Sentiment Analysis Pang and Lee, '08 (skip Section 5 and only skim Section 6), Pang and Lee, '02
26 Tue. 4/8 HW3 Due Discourse Analysis Arguello '15
27 Thu. 4/10   Detecting Viewpoint Weibe '10
28 Tue. 4/15   Text-based Forecasting Lerman et al., '08
29 Thu. 4/17 Well-being Day (No Class)    
30 Tue. 4/22   Final Project Presentations I  
31 Thu. 4/24 Literature Review Due Final Project Presentations II  
32 Sat. 5/3 Project Due