Python Text Analysis With the Schrutepy Package
LinkedIn | Github | Blog | Subscribe
Following the success of the {schrute} R package, many requests came in for the same dataset ported over to Python. The schrute and schrutepy packages serve one purpose only: to load the entire transcripts from The Office, so you can perform NLP, text analysis or whatever with this fun dataset.
Quick start
Install the package with pip:
pip install schrutepy
Then import the dataset into a dataframe:
from schrutepy import schrutepy
= schrutepy.load_schrute() df
That’s it. Now you’re ready.
Long example
Now we’ll quickly work through some common elementary text analysis functions.
from schrutepy import schrutepy
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import nltk
from nltk.corpus import stopwords
from PIL import Image
import numpy as np
import collections
import pandas as pd
Load the entire transcript with the load_schrute function
= schrutepy.load_schrute() df
Inspect the data
df.head()
|
index |
season |
episode |
episode_name |
director |
writer |
character |
text |
text_w_direction |
---|---|---|---|---|---|---|---|---|---|
0 |
1 |
1 |
1 |
Pilot |
Ken Kwapis |
Ricky Gervais;Stephen Merchant;Greg Daniels |
Michael |
All right Jim. Your quarterlies look very good… |
All right Jim. Your quarterlies look very good… |
1 |
2 |
1 |
1 |
Pilot |
Ken Kwapis |
Ricky Gervais;Stephen Merchant;Greg Daniels |
Jim |
Oh, I told you. I couldnt close it. So… |
Oh, I told you. I couldnt close it. So… |
2 |
3 |
1 |
1 |
Pilot |
Ken Kwapis |
Ricky Gervais;Stephen Merchant;Greg Daniels |
Michael |
So youve come to the master for guidance? Is … |
So youve come to the master for guidance? Is … |
3 |
4 |
1 |
1 |
Pilot |
Ken Kwapis |
Ricky Gervais;Stephen Merchant;Greg Daniels |
Jim |
Actually, you called me in here, but yeah. |
Actually, you called me in here, but yeah. |
4 |
5 |
1 |
1 |
Pilot |
Ken Kwapis |
Ricky Gervais;Stephen Merchant;Greg Daniels |
Michael |
All right. Well, let me show you how its done. |
All right. Well, let me show you how its done. |
Some of the records don’t contain dialogue
= df.dropna() df
Create a wordcloud of all the text in the entire series
= " ".join(review for review in df.text) text
print ("There are {} words in the combination of all review.".format(len(text)))
There are 3001517 words in the combination of all review.
# Create stopword list:
'stopwords')
nltk.download(= set(stopwords.words('english'))
stopWords
# Generate a word cloud image
= WordCloud(stopwords=stopWords, background_color="white").generate(text)
wordcloud
# Display the generated image:
# the matplotlib way:
=[30,15])
plt.figure(figsize='bilinear')
plt.imshow(wordcloud, interpolation"off")
plt.axis( plt.show()
[nltk_data] Downloading package stopwords to /home/xps/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Let’s do this same thing for a few of the characters. Might as well make a function at this point…
def plotDunder(character, df):
= df[df.character == character]
mydf = " ".join(review for review in mydf.text)
text1 # Generate a word cloud image
= WordCloud(stopwords=stopWords, background_color="white").generate(text1)
wordcloud
# Display the generated image:
# the matplotlib way:
=[15,7])
plt.figure(figsize='bilinear')
plt.imshow(wordcloud, interpolation
plt.title(character)"off")
plt.axis(
plt.show()
= ["Michael", "David Wallace", "Dwight", "Jim", "Pam", "Oscar", "Phyllis", "Creed", "Ryan",] fav
for i in fav:
plotDunder(i, df)
Let’s make on in the shape of Dwight’s large head
= np.array(Image.open("schrutepy.png")) dwight_mask
# Create a word cloud image
= WordCloud(background_color="white", max_words=1000, mask=dwight_mask,
wc =stopWords, contour_width=1, contour_color='grey')
stopwords
# Generate a wordcloud
wc.generate(text)
# show
=[30,15])
plt.figure(figsize='bilinear')
plt.imshow(wc, interpolation"off")
plt.axis(
plt.show()
"final_schrute.png") wc.to_file(
<wordcloud.wordcloud.WordCloud at 0x7fa1036a8b00>
Now let’s find and plot the most common word spoken by my favorite characters
def commonWord(character, df):
= df[df.character == character]
mydf = " ".join(review for review in mydf.text)
text = {}
wordcount # To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in text.lower().split():
= word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
word if word not in stopWords:
if word not in wordcount:
= 1
wordcount[word] else:
+= 1
wordcount[word]
# Print most common word
= int(10)
n_print # print("\nOK. The {} most common words are as follows\n".format(n_print))
= collections.Counter(wordcount)
word_counter for word, count in word_counter.most_common(n_print):
pass
# Close the file
# Draw a bar chart
= word_counter.most_common(n_print)
lst = pd.DataFrame(lst, columns = ['Word', 'Count'])
df ='Word',y='Count', title=character) df.plot.bar(x
for i in fav:
commonWord(i, df)
Star this repo on Github?
Want more content like this? Subscribe here