Text Analysis of Martha Ballard’s Diary (Part 1)

“mr Ballard left home bound for Oxford. I had been Sick with the Collic. mrs Savage went home. mrs foster Came at Evening. it snowd a little.”

This is the first entry in the diary of Martha Ballard. Martha Ballard was a rural Maine midwife who kept an extensive diary between 1785 and 1812 and whose life was immortalized in 1990 by the historian Laurel Thatcher Ulrich‘s award-winning A Midwife’s Tale. Over the course of three decades, Ballard kept a meticulous, near-daily accounting of her life spanning over 10,000 entries.

When reading A Midwife’s Tale, I was struck by how readily the text would seem to lend itself to digital analysis. In an interview, Ulrich noted, “The very thing that had attracted me to the diary in the first place was also the thing that made it difficult to work with. I mean there’s just so much.” To ground herself, she began by simply counting things: “And I would go day by day for every other year of the diary, and I would tick off what was in each entry: baking or brewing, spinning or washing, or trading, sewing, mending, deliveries, general medical accounts, going to church, visitors, people coming for meals, etc.” Because of the sprawling scope, she took this quantitative approach only for the even-numbered years in the diary. The fact that she was working in the late eighties without a computer makes her work even more impressive.

After poking around online I came across DoHistory.org, a website developed and maintained by the Film Study Center at Harvard University and hosted by (who else, really) George Mason’s CHNM. The website presents the diary to the public in two formats: the viewer can either browse through photographed pages of the diary or read the transcript of the pages (transcribed through a monumental effort by Robert R. McCausland and Cynthia MacAlman McCausland):

ballardpage1 ballardpage1text

When I realized the entire diary was online, it got me thinking about possibilities for text mining. As an aspiring digital humanist with little “hard” skills beyond basic GIS, I had been meaning to learn how to program for quite some time. In Martha Ballard’s diary, I had an intriguing source of data with which to learn how to do so. Now I just had to learn how to program. With the patient help of several programming-savvy family members, I gradually learned the basics of Python and how to apply it to Martha Ballard’s diary. What follows are the first steps we took to process the diary’s raw data into an accessible digital format.

Process

At first, I briefly considered learning how to scrape the text of the diary off the website. After some investigation, I decided that was a little beyond my abilities, so I copped out to the much easier route of sending an email to Kelly Schrum at CHNM, who kindly forwarded my request to Ammon Shepherd, who emailed me a zip file containing 1,431 html documents, one for each page of the diary. The html files of the transcribed diary are a basic, 3-column table that look this. My first step was to find a way to strip out the html tags and organize the text into a systematic database of individual entries. Fortunately, Ballard’s meticulousness and consistency lent itself well to such an approach.

The diary’s format translates quite nicely into creating a list of lists – the “main” diary being a list of all the entries, and each entry being a list in and of itself. The first program we wrote was to open each html file and begin extracting the different sections of text (which were conveniently marked by html tags). Iterating through each entry allowed us to separate the different columns in her diary into different items in the list. Here is the breakdown of our “list of lists”:

  1. Diary
    1. Entry
      1. Date
        1. Month
        2. Day
        3. Year
      2. Day of the Week
      3. Main Text of Entry
      4. Day Summaries (Column 3 of actual diary entry)
      5. Birth(s) (Recorded in Column 1 of actual diary entry)

In creating the list, we had to separate out the raw data from the html tags that formatted it. Fortunately, the folks who built the html files originally used an extremely systematic formatting process that actually made the job of distilling one from the other quite straightforward. A Python module called Pickle allowed us to export the list of entries as a manageable single file that we could then easily import into future programs to manipulate.

For example, the third entry in the diary would translate a bit into something like this:

  1. Diary
    1. Entry (3)

      1. Date
        1. 1 (January)
        2. 3
        3. 1785
    2. 3 (Tuesday – Ballard numbered the weekdays, beginning with Sunday as 1)
    3. “Tuesday. mrs. Foster went home. I had threats of thee Collic; by takein peper found releif.”
    4. Empty
    5. Empty

The list allows us to access pieces of information by “calling” their position. It helped me to think of the entire diary list as a warehouse containing almost 10,000 boxes (entries) inside it, with each box containing five compartments, with the first of those compartments divided into three sub-compartments. If you were to open any of the boxes (entries) and look inside the first compartment, then inside sub-compartment number two, you would always find a number that represented the month of that particular entry. If you were to look inside the third compartment of the entry/box, you would always find the main text for that day’s entry.

The advantages of setting up the data in a list structure is the ability to access these specific pieces of information easily and to compare them across entries. In many ways, processing the text to make it readable and programmable is one of the biggest challenges to text mining. Deciding on the most logical way to organize and break down over 1,400 files will lay the groundwork for the fun part: writing programs to actually analyze the diary of Martha Ballard.

***Special-edition sneak preview of future posts in this series***

A simple counting program reveals that the main text of Martha Ballard’s diary alone contains 377,315 words, spanning I-couldn’t-make-this-number-up 9,999 entries. That is a lot of data to play with.

24 thoughts on “Text Analysis of Martha Ballard’s Diary (Part 1)

    1. Ben,
      Thanks for the comment – I think it’s pretty normal for methodological approaches to get shaded by the text, which I admit can be both a good and bad thing. From what I understand you have a far more sophisticated grasp of programming than I do, so I welcome any and all advice from your own experience.
      -Cameron

  1. That’s a bit comforting. I’d love to chat with you about this, but suggest that we wait until you’ve posted your observations. I’m very interested to see what sort of data you think is extractable.

    1. You boys have taken on one h— of a job
      and I for one would consider it a personal favor if
      you would alert me when you have this project well under way.
      I’m the guy (with Cyn) that transcribed the 9,999 words that Martha wrote.
      Robert McCausland

  2. Simply wish to say your article is as surprising. The clearness to your
    put up is simply spectacular and that i can assume you are a professional in this subject.
    Fine along with your permission allow me to grab
    your RSS feed to keep updated with coming near near post.

    Thank you 1,000,000 and please continue the enjoyable
    work.

  3. Thank you, with this good article, I found what I was currently looking for about online poker games.
    Posts that are very complete, I hope you continue to work, provide good information to us readers, even your customers.
    We will wait for new posts from your website.}

  4. Post this content at article distribution sites, many of which cost nothing to utilize.
    Don’t waste time filling your auto-responder with names in the event you don’t have any content ready
    for them. Hold a Free e – Book Giveaway Contest – This is a good way to get
    the new e – Book in front of new readers.

  5. With Adobe Photoshop, it is possible to raise or derease contrast, brightness,
    huge, and even color intensity. From Barmans online, you will possess the whole baar and catering materials
    covered-along together with your home bar as well as your outdoor dining set up.
    The mention of Bro-step and American expansioln of the genre is undeniable within the first kind context.

  6. Awesome blog you have here but I was curious about if you knew of any discussion boards that cover the same topics discussed here?
    I’d really like to be a part of online community
    where I can get feed-back from other knowledgeable individuals that share the same interest.
    If you have any suggestions, please let me know.
    Many thanks!

  7. Thank you for another excellent post. The place else may just anybody get that type of information in such a perfect manner of writing?
    I have a presentation next week, and I’m at the search for such info.

  8. Magnificent goods from you, man. I’ve remember your stuff
    prior to and you’re just too magnificent.

    I really like what you’ve received right here, certainly like what you’re stating and the way wherein you are saying it.
    You make it enjoyable and you still care for to stay it wise.
    I can’t wait to read far more from you. That is actually a tremendous website.

Leave a Reply

Your email address will not be published. Required fields are marked *