Homework 1: Analyzing Plain Text
Due date: See class schedule
In this assignment you’ll gain skills with basic Python control flow (conditionals, iteration, and so forth), data structures (lists and dictionaries), and file I/O. You’ll also get experience with integrating external packages into Python.
The goal of this assignment is to create a program that will be passed a folder name as a parameter. Your program will get the contents of this folder (using os.listdir) and iterate through all of the text files (which you can assume are all of the files that have names ending in .txt). Your program will read in the contents of these text files and perform the following analysis on them; the details of this analysis should be printed out when your program completes running:
- For each file, the count of the number of times each unique word appears in that file, as well as the percentage that word count represents of the overall number of words in the file. Be sure to ignore differences in capitalization as well as any punctuation that may be in the file. For example, if the file contains “A man, a plan, Panama” your program should print out something similar to:
File: panama.txt A: 2 (40% of total words) man: 1 (20% of total words) plan: 1 (20% of total words) Panama: 1 (20% of total words)
- For each file, determine the sentiment and the objectivity of the language in the text contained in it. To do this, you should install and import the TextBlob package (we’ll discuss this in class), and then use its API to determine whether the tone of the text is positive, negative or neutral, and whether it’s more subjective or objective. For example:
Sentiment is STRONGLY POSITIVE (0.511132334) Subjectivity is MILDLY SUBJECTIVE (0.59322321)
- Finally, for the entire set of documents in the folder, your program should calculate and print the total number of words and unique words contained across all documents. For example:
Document corpus contains 2,397 total words (1,486 unique words)
For the sentiment and objectivity analysis, you’ll need to install the TextBlob package: this is an external module (meaning a module that doesn’t ship with Python) that you’ll need to put on your computer in order to use.
This page has a tutorial on TextBlob and details about installing it. Note that you’ll need to install the corpora (the language training dataset) in order to get the best results. We’ll walk through how to install in class, and the link just above has details as well.
HINTS:
Remember that your program should take as an input parameter the name of the folder containing the text files. There may be other, non-text files in this folder, so your program should only process files ending in .txt.
There is a folder of sample news articles posted on T-Square in the resources folder, called sampletext.zip.
For the word count portion of the assignment, the “trick” is to use a dictionary such that words are keys and the counts for each of those words is the associated value. Since keys are unique this will let you keep track of the count for each unique word in a given file.
You should be able to install the textblob API via pip. Once it’s installed on your computer, use “from textblob import TextBlob” to access the sentiment analysis functions.
TO SUBMIT:
Create and submit on T-Square a ZIP file containing:
- Your Python program
- A README.txt file containing anything we may need to know about your program (e.g., extra functionality you implemented).
COMMENTARY AND EXTRA CREDIT:
This homework demonstrates some rudimentary text processing and file I/O. Using this program you can determine the words that occur most frequently in a given corpus of text files. You may notice, however, that simply knowing the most commonly occurring words does not tell you much about the contents of the file (for example, simple words like conjunctions (“and”) or determiners (“the”) often are the most common words in an English text).
What if you wanted to determine which words were the most important in a text (for some definition of important)? One way could be to look for words that occur commonly in a given text, but are not common across the corpus as a whole. This would eliminate words like “and” and “the” and focus only on the unusual words that are the most specific to that given text.
One of the foundational ways to do this is a technique called term frequency-inverse document frequency (or TF-IDF for short). The idea is to find terms that have a high frequency of occurrence in a document but low frequency across the rest of a corpus. The algorithm is a foundation for information retrieval and text analysis.
For up to 10 points extra credit, extend the program you have to implement TF-IDF in addition to the basic word counting and sentiment analysis capabilities. The output of the TF-IDF portion of your program should be, for each file, the top 10 most “distinctive” terms from that file, compared to the corpus over all. In other words, the code should print out, for each file, which terms have the highest computed TF-IDF. The Wikipedia article gives a concise outline of the algorithm, if you want to try it.