One of the most basic applications of text mining is simply counting words. I began by stripping out punctuation (in order to avoid differentiating mend and mend. as two separate words), put every word into lowercase, and then ignored a list of stop words (the, and, for, etc.). By writing a program to count occurrences of the 500 most common words, I could get a general (and more quantitative) sense for what general topics Martha Ballard wrote about in her diary. Unsurprisingly, her vocabulary usage followed a standard path of exponential decay: like most people, she utilized a relatively small number of words with extreme frequency. For example, the most common word (mr) occurred 10,050 times, while her 500th most common word (relief) occurred 67 times:
Because each word has information attached to it – specifically what date it was written – we can look at long-term patterns for a particular word’s usage. However, looking at only raw word frequencies can be problematic. For example, if Ballard wrote the word yarn twice as often in 1801 as 1791, it could mean that she was doing a lot more knitting in her old age. But it could also mean that she was writing a lot more words in her diary overall. In order to address this issue, for any word I was examining I made sure to normalize its frequency – first by dividing it by the total word count for that year, then by dividing it by the average usage of the word over the entire diary. This allowed me to visualize how a word’s relative frequency changed from year to year.
In order to visualize the information, I settled on trying out sparklines: “small, intense, simple datawords” advocated by infographics guru Edward Tufte and meant to give a quick, somewhat qualitative snapshot of information. To test my method, I used a theme that Laurel Ulrich describes in A Midwife’s Tale: land surveying. In particular, during the late 1790s Martha’s husband Ephraim became heavily involved in surveying property. In the raw word count list, both survey and surveying appear in the top 500 words, so I combined the two and looked at how Martha’s use of them in her diary changed over the years (1785-1812):
Looking at the sparkline, we get a visual sense for when surveying played a larger role in Martha’s diary – around the middle third, or roughly 1795-1805, which corresponds relatively well to Ulrich’s description of Ephraim’s surveying adventures. As a basis for comparison, the word clear appeared with numbing regularity (almost always in reference to the weather):
Using word frequencies and sparklines, I could investigate and visualize other themes in the diary as well.
Out of the 500 most frequent words in the diary, only three of them relate directly to religion: meeting (#28), worship (#143), and god (#220).
Meeting, which was used largely in a religious context (going to a church meeting), but also in a socio-political context (attending town meetings), had a relatively consistent rate of use, although it trended slightly upwards over time. Worship (which Martha largely used in the sense of “went to publick worship”), meanwhile, was more erratic and trended slightly downwards. Finally, and perhaps most interestingly, was Martha’s use of the word god. Almost non-existent in the first third of her diary, it then occurred much more frequently, but also more erratically over the final two-thirds of the diary. Not only was it a relatively infrequent word overall (flax, horse, and apples occur more often), but its usage pattern suggests that Martha Ballard did not directly invoke a higher power on a personal level with any kind of regularity (at least in her diary). Instead, she was much more comfortable referring to the more socially and community-based activity of attending a religious service. While a qualitative close reading of the text would give a richer impression of Martha’s spirituality, a quantitative approach demonstrates how little “real estate” she dedicates to religious themes in her diary.
Most of the words related to death show an erratic pattern. There are peaks and valleys across the years without much correlation between the different words, and the only word that appears with any kind of consistency is interd (interred). In this case, word frequency and sparklines are relatively weak as an analytical tool. They don’t speak to any kind of coherent pattern, and at most they vaguely point towards additional questions for study – what causes the various extreme peaks in usage? Is there a common context with which Martha uses each of the words? Why was interd so much flatter than the others?
In this final section, I’ll offer up a small taste of how analyzing word frequency can reveal interpersonal relationships. I used the particular example of Dolly (Martha’s youngest daughter):
The sparkline does a phenomenal job of driving home a drastic change in how Martha refers to her daughter. In a matter of a year or two in the mid 1790s, she goes from writing about Dolly frequently to almost never mentioning her. Why? Some quick detective work (or reading page 145 in A Midwife’s Tale) shows that the plummet coincides almost perfectly with Dolly’s marriage to a man named Barnabas Lambart in 1795. But why on earth would Martha go from mentioning Dolly all the time in her diary to going entire years without writing her name? Did Martha disapprove of her daughter’s marriage? Was it a shotgun wedding?
The answer, while not so scandalous, is an interesting one nonetheless that text analysis and visualization helps to elucidate. In short, Martha still writes about her daughter after 1795, but instead of referring to her as Dolly, she begins to refer to her as Dagt Lambd (Daughter Lambert). This is a fascinating shift, and one whose full significance might get lost by a traditional reading. A human poring over these detailed entries might get a vague impression that Martha has started calling her daughter something different, but the sparkline above drives home just how abrupt and dramatic that transformation really was. Martha, by and large, stopped calling her youngest daughter by her first name and instead adopted the new husband’s proper name. Such a vivid symbolic shift opens up a window into an array of broader issues, including marriage patterns, familial relationships, and gender dynamics.
Counting word frequency is a somewhat blunt instrument that, if used carefully, can certainly yield meaningful results. In particular, utilizing sparklines to visualize individual word frequencies offers up two advantages for historical inquiry:
- Coherently display general trends
- Reveal outliers and anomalies
First, sparklines are a great way to get a quick impression of how a word’s use changes over time. For example, we can see above that the frequency of the word expired steadily increases throughout the diary. While this can often simply reiterate suspected trends, it can ground these hunches in refreshingly hard data. By the end of the diary, a reader might have a general sense for how certain themes appear, but a text analysis can visualize meaningful patterns and augment a close reading of the text.
Second, sparklines can vividly reveal outliers. In the course of reading hundreds of thousands of words over the course of nearly 10,000 entries, it’s quite easy to lose sight of the forest for the trees (to use a tired metaphor). Visualizing word frequencies allows historians to gain a broader perspective on a piece of the text, and they also act as signposts pointing the viewer towards a specific area for further investigation (such the red-flag-raising rupture in how frequently Dolly appears). Relatively basic word frequency by itself (such as what I’ve done here) does not necessarily explain anomalies, but it can do an impressive job of highlighting important ones.