Here are some terms I’ll ask you to define on the final:
topic modelling (bag of words)
directed vs. […]
Why do you think you got slightly different results with these two methods? I’m somewhat surprised there are completely separate groups, but I like that you connected this to your other oppose visualization, using the two to reflect against each other.
Lol, nodes are the points, edges are the connections. In our case, “undirected” (being equal in both directions).
You don’t have to size the nodes, but it might help see the relationships. It’s sort of like in a dendrogram when you can measure the distance between items to see their distance . . . a larger-sized mode means it has higher…[Read more]
I think this is an interesting data set. I’d be curious to see how the role of different speechwriters plays an effect, since the SOTUs are heavily “scripted” (as an aside, looking at transcripts of presidents’ unrehearsed comments would also be illuminating).
I think our assumption would be that presidents should cluster together based on…[Read more]
I’m curious about your expectations for comparing Twain and Dickens. Were you looking for national differences? And what would testing Jane Austen help you see?
Why do you think Austen is more like Dickens than Twain? And why is Twain more like Dickens than Austen? And do those results contradict each other? It’s unclear to me how you’re…[Read more]
I think you’re absolutely right to reflect on how making decisions about dicing your corpus sort of determines what exactly you are hoping to test. It also is the moment where our preconceived ideas about literary style can sneak in . . . such as the relative importance of nationality, gender, etc. Looking at protagonist gender is interesting, and…[Read more]
I wish I could see your preferred/avoided words better–I’m wondering if they were surprising to you or if anything stood out from that list?
You should also run oppose() again and select “markers” in order to get the plots pointed in those two overlapping shapes, with plusses for your test set.
One challenge for interpreting your results is…[Read more]
It’s interesting to see some of your assumptions be revealed around the “consistency” so to speak of a male authorial voice over time.
You’re correct about interpreting the PCA space: the closer to zero, the more similar in usage of the preferred word list. One thing that would make it easier to interpret would be to remove the leading “f_” so…[Read more]
This method identifies words that are distinct between two corpora; which words one prefers and avoids.
This method will always find difference, so have to tread lightly.
Can compare author to author, group of […]
Don’t forget that you can also add the stop-words back in to see if that would make a difference in terms of the results you see. Along those lines, I bet “footnote” is something you could remove to try to get better results, since it is clearly skewing your y-axis.
It seems like you’re seeing nations spread across the x-axis. Is there any…[Read more]
You need to add citations for any material you are quoting or paraphrasing.
I wanted to see you better explain your topics, and if you think you had enough or too many to gain insights into your corpus.
I think you may have only done this with stopwords turned on? Don’t forget that this approach, while fine, is also excluding the actual most frequent words.
I’m guessing the word “gutenberg” is so prominent since it’s part of the boilerplate language for most of your corpus? “TM” and work/works are also likely in the same group, and should be…[Read more]
I’m not sure I understand the difference between the first and second visualization. What did you do differently there?
Don’t forget that you should run these results both with and without stop words, to see if your argument holds at both levels.
I was hoping you could do better in explaining what you see along the two principal component…[Read more]
I hope you find PCA in R easier! Not as many options, but much less fiddly with final results.
The results you got here, though, look very compelling and ordered. The pink and green clusters are rather distinct, as is the more dominant upper left group. Looking at MFW words, clearly the use of “the” is a good predictor of how to separate the…[Read more]
Did you try adding stop words in? I wonder if you would notice the same results at both levels, particularly the question of why King Arthur stands out (and what distinguishes fairy tales from history).
I was hoping you could do a better job explaining what you see with the two principal components. Is principal component 1 capturing the…[Read more]
I’m not sure I understand what you the two principal components are. In the third visualization, there is a strong effect on the Y dimension, but only a little variance on the x-axis. How do you explain that? I wonder why “bones” is so important in accounting for difference? You note how it and “captain” are unique outliers, but I want to know…[Read more]
I think your corpus provides a great opportunity for topic modeling, since we would expect different politicians to value, and talk about, certain concepts more than others.
Do you think ten topics was the right amount? Or were there perhaps more ways to split?
I see what you mean about your texts perhaps being too homogenous, but that could be a strength if you do see difference, since you’re working in a fairly well-defined genre.
I’m curious about how we can interpret your two principal components. Did you try adding stop words back in? It looks like a handful of Obama speeches really tilt the…[Read more]
Great observation about how the “nation signal” seems to take over the gender on in your corpus. I’m curious if you would have had the same results adding the stop words back in? That might also reduce the effect of “pickwick” somewhat.
I was hoping you could say a little more about what was being expressed by the two principal components. You…[Read more]
- Load More