The Perpetual Sunrise of Methodology

[The following is the text of a talk I prepared for a panel discussion about authoring digital scholarship for history with Adeline Koh, Lauren Tilton, Yoni Appelbaum, and Ed Ayers at the 2015 American Historical Association Conference.]

 
I’d like to start with a blog post that was written almost seven years ago now, titled “Sunset for Ideology, Sunrise for Methodology?” In it, Tom Scheinfeldt argued that the rise of digital history represented a disciplinary shift away from big ideas about ideology or theory and towards a focus on “forging new tools, methods, materials, techniques, and modes or work.” Tom’s post was a big reason why I applied to graduate school. I found this methodological turn thrilling – the idea that tools like GIS, text mining, and network analysis could revolutionize how we study history. Seven years later the digital turn has, in fact, revolutionized how we study history. Public history has unequivocally led the charge, using innovative approaches to archiving, exhibiting, and presenting the past in order to engage a wider public. Other historians have built powerful digital tools, explored alternative publication models, and generated online resources to use in the classroom.
 
But there is one area in which digital history has lagged behind: academic scholarship. To be clear: I’m intentionally using “academic scholarship” in its traditional, hidebound sense of marshaling evidence to make original, explicit arguments. This is an artificial distinction in obvious ways. One of digital history’s major contributions has, in fact, been to expand the disciplinary definition of scholarship to include things like databases, tools, and archival projects. The scholarship tent has gotten bigger, and that’s a good thing. Nevertheless there is still an important place inside that tent for using digital methods specifically to advance scholarly claims and arguments about the past.
 
In terms of argument-driven scholarship, digital history has over-promised and under-delivered. It’s not that historians aren’t using digital tools to make new arguments about the past. It’s that there is a fundamental imbalance between the proliferation of digital history workshops, courses, grants, institutes, centers, and labs over the past decade, and the impact this has had in terms of generating scholarly claims and interpretations. The digital wave has crashed headlong into many corners of the discipline. Argument-driven scholarship has largely not been one of them.
 
There are many reasons for this imbalance, including the desire to reach a wider audience beyond the academy, the investment in collection and curation needed for electronic sources, or the open-ended nature of big digital projects. All of these are laudable. But there is another, more problematic, reason for the comparative inattention to scholarly arguments: digital historians have a love affair with methodology. We are infatuated with the power of digital tools and techniques to do things that humans cannot, such as dynamically mapping thousands of geo-historical data points. The argumentative payoffs of these methodologies are always just over the horizon, floating in the tantalizing ether of potential and possibility. At times we exhibit more interest in developing new methods than in applying them, and in touting the promise of digital history scholarship rather than its results. 
 
What I’m going to do in the remaining time is to use two examples from my own work to try and concretize this imbalance between methods and results. The first example is a blog post I wrote in 2010. At the time I was analyzing the diary of an eighteenth-century Maine midwife named Martha Ballard, made famous by Laurel Ulrich’s prize-winning A Midwife’s Tale. The blog post described how I used a process called topic modeling to analyze about 10,000 diary entries written by Martha Ballard between 1785 and 1812. To grossly oversimplify, topic modeling is a technique that automatically generates groups of words more likely to appear with each other in the same documents (in this case, diary entries). So, for instance, the technique grouped the following words together:
 
gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
 
As a human reader it’s pretty clear that these are words about gardeningOnce I generated this topic, I could track it across all 10,000 entries. When I mashed twenty-seven years together, it produced this beautiful thumbprint of a New England growing season.
 
Seasonal Presence of GARDENING topic in Martha Ballard’s Diary
 
Interest in topic modeling took off right around the time that I wrote this post, and pretty soon it started getting referenced again and again in digital humanities circles. Four and a half years later, it has been viewed more than ten thousand times and been assigned on the syllabi of at least twenty different courses. It’s gotten cited in books, journal articlesconference presentations, grant applications, government reports, white papers, and, of course, other blogs. It is, without a doubt, the single most widely read piece of historical writing I have ever produced. But guess what? Outside of the method, there isn’t anything new or revelatory in it. The post doesn’t make an original argument and it doesn’t further our understanding of women’s history, colonial New England, or the history of medicine. It largely shows us things we already know about the past – like the fact that people in Maine didn’t plant beans in January.
 
People seized on this blog post not because of its historical contributions, but because of its methodological contributions. It was like a magic trick, showing how topic modeling could ingest ten thousand diary entries and, in a matter of seconds, tell you what the major themes were in those entries and track them over time, all without knowing the meaning of a single word. The post made people excited for what topic modeling could do, not necessarily what it did do; the methodology’s potential, not its results.
 
About four years after I published my blog post on Martha Ballard, I published a very different piece of writing. This was an article that appeared in last June’s issue of the Journal of American History, the first digital history research article published by the journal. In many ways it was a traditional research article, one that followed the journal’s standard peer review process and advanced an original argument about American history. But the key distinction was that I made my argument using computational techniques. 
 
The starting premise for my argument was that the late nineteenth-century United States has typically been portrayed as a period of integration and incorporation. Think of the growth of railroad and telegraph networks, or the rise of massive corporations like Standard Oil. In nineteenth-century parlance: “the annihilation of time and space.” This existing interpretation of the period hinges on geography – the idea that the scale of locality and region were getting subsumed under the scale of nation and system. I was interested in how these integrative forces actually played out in the way people may have envisioned the geography of the nation. 
 
So I looked at a newspaper printed in Houston, Texas, during the 1890s and wrote a computer script that counted the number of times the paper mentioned different cities or states. In effect, how one newspaper crafted an imagined geography of the nation. What I found was that instead of creating a standardized, nationalized view of the world we might expect, the newspaper produced space in ways that centered on the scale of region far more than nation. It remained overwhelmingly focused on the immediate sphere of Texas, and even more surprisingly, on the American Midwest. Places like Kansas City, Chicago, and St. Louis were far more prevalent than I was expecting, and from this newspaper’s perspective Houston was more of a midwestern city than a southern one. 
 
Cameron Blevins, “Space, Nation, and the Triumph of Region: A View of the World from Houston,” Journal of American History, 101, no. 1 (June 2014), 127.
 
I would have never seen these patterns without a computer. And in trying to account for this pattern I realized that, while historians might enjoy reading stuff like this…
 
maine_zoom
 
…newspapers often look a lot more like this:
 
rr_timetable_crop
 
All of this really boring stuff – commodity prices, freight rates, railroad timetables, classified ads – made up a shockingly large percentage of content. Once you include the boring stuff, you get a much different view of the world from Houston in the 1890s. I ended up arguing that it was precisely this fragmentary, mundane, and overlooked content that explained the dominance of regional geography over national geography. I never would have been able to make this argument without a computer.
 
The article offers a new interpretation about the production of space and the relationship between region and nation. It issues a challenge to a long-standing historical narrative about integration and incorporation in the nineteenth-century United States. By publishing it in the Journal of American History, with all of the limitations of a traditional print journal, I was trying to reach a different audience from the one who read my blog post on topic modeling and Martha Ballard. I wanted to show a broader swath of historians that digital history was more than simply using technology for the sake of technology. Digital tools didn’t just have the potential to advance our understanding of American history – they actually did advance our understanding of American history.
 
To that end, I published an online component that charted the article’s digital approach and presented a series of interactive maps. But in emphasizing the methodology of my project I ended up shifting the focus away from its historical contributions. In the feedback and conversations I’ve had about the article since its publication, the vast majority of attention has focused on the method rather than the result: How did you select place-names? Why didn’t you differentiate between articles and advertisements? Can it be replicated for other sources? These are all important questions, but they skip right past the arguments that I’m making about the production of space in the late nineteenth century. In short: the method, not the result. 
 
I ended my article with a familiar clarion call:
Technology opens potentially transformative avenues for historical discovery, but without a stronger appetite for experimentation those opportunities will go unrealized. The future of the discipline rests in large part on integrating new methods with conventional ones to redefine the limits and possibilities of how we understand the past.
This is the rhetorical style of digital history. While reading through conference program I was struck by just how many abstracts about digital history used the words “potential,” “promise,” “possibilities,” or in the case of our own panel, “opportunities.” In some ways 2015 doesn’t feel that different from 2008, when Tom Scheinfeldt wrote about the sunrise of methodology and the Journal of American History published a roundtable titled “The Promise of Digital History.” I think this is telling. Academic scholarship’s engagement with digital history seems to operate in a perpetual future tense. I’ve spent a lot of my career talking about what digital methodology can do to advance scholarly arguments. It’s time to start talking in the present tense.

The County Problem in the West

Happy GIS Day! Below is a version of a lightning talk I’m giving today at Stanford’s GIS Day.

Historians of the American West have a county problem. It’s primarily one of geographic size: counties in the West are really, really big. A “List of the Largest Counties in the United States” might as well be titled “Counties in the Western United States (and a few others)” – you have to go all the way to #30 before you find one that falls east of the 100th meridian. The problem this poses to historians is that a lot of historical data was captured at a county level, including the U.S. Census.

521px-Map_of_California_highlighting_San_Bernardino_County.svg
San Bernardino County

San Bernardino County is famous for this – the nation’s largest county by geographic area, it includes the densely populated urban sprawl of the greater Los Angeles metropolis along with vast swathes of the uninhabited Mojave Desert. Assigning a single count of anything to San Bernardino county to is to teeter on geographic absurdity. But, for nineteenth-century population counts in the national census, that’s all we’ve got.

TheWest_1871_Population-01-01

Here’s a basic map of population figures from the 1870 census. You can see some general patterns: central California is by far the most heavily populated area, with some moderate settlement around Los Angeles, Portland, Salt Lake City, and Santa Fe. But for anything more detailed, it’s not terribly useful. What if there was a way to get a more fine-grained look at settlement patterns in these gigantic western counties? This is where my work on the postal system comes in. There was a post office in (almost) every nineteenth-century American town. And because the department kept records for all of these offices – the name of the office, its county and state, and the date it was established or discontinued – a post office becomes a useful proxy to study patterns over time and space. I assembled this data for a single year (1871) and then wrote a program to geocode each office, or to identify its location by looking it up in a large database of known place-names. I then supplemented it with the the salaries of postmasters at each office for 1871. From there, I could finally put it all onto a map:

TheWest_1871_PostOffices

The result is a much more detailed regional geography than that of the U.S. Census. Look at Wyoming in both maps. In 1870, the territory was divided into five giant rectangular counties, all of them containing less than 5,000 people. But its distribution of post offices paints a different picture: rather than vertical units, it consisted largely of a single horizontal stripe along its southern border.

Wyoming_census-02   Wyoming_postoffices-02

Similarly, our view of Utah changes from a population core of Salt Lake City to a line of settlement running down the center of the territory, with a cluster in the southwestern corner completely obscured in the census map.

Utah_census-01   Utah_postoffices-01

Post offices can also reveal transportation patterns: witness the clear skeletal arc of a stage-line that ran from the Oregon/Washington border southeast to Boise, Idaho.

Dalles_Boise

Connections that didn’t mirror the geographic unit of a state or county tended to get lost in the census. One instance of this was the major cross-border corridor running from central Colorado into New Mexico. A map of post offices illustrate its size and shape; the 1870 census map can only gesture vaguely at both.

ColoradoNewMexico_census-02   ColoradoNewMexico_postoffices-02

The following question, of course, should be asked of my (and any) map: what’s missing? Well, for one, a few dozen post offices. This speaks to the challenges of geocoding more than 1,300 historical post offices, many of which might have only been in existence for a single year or two. I used a database of more than 2 million U.S. place-names and wrote a program that tried to account for messy data (spelling variations, altered state or county boundaries, etc.). The program found locations for about 90% of post offices, while the remaining offices I had to locate by hand. Not surprisingly, they were missing from the database for a reason: these post offices were extremely obscure. Finding them entailed searching through county histories, genealogy message boards, and ghost town websites – a process that is simply not scalable beyond a single year. By 1880, the number of post offices in the West had doubled. By 1890, and it doubled again. I could conceivably spend years trying to locate all of these offices. So, what are the implications of incomplete data? Is automated, 90% accuracy “good enough”?

What else is missing? Differentiation. The salary of a postmaster partially addresses this problem, as the department used a formula to determine compensation based partially on the amount of business an office conducted. But it was not perfectly proportional. If it was, the map would be one giant circle covering everything: San Francisco conducted more business than any other office by several orders of magnitude. As it is, the map downplays urban centers while highlighting tiny rural offices. A post office operates in a kind of binary schema: no office, no people (well, at least very few). If there was an office, there were people there. We just don’t know how many. The map isn’t perfect, but it does start to tackle the county problem in the West.

*Note: You can download a CSV file containing post offices, postmaster salaries, and latitude/longitude coordinates here.*

Coding a Middle Ground: ImageGrid

Openness is the sacred cow of the digital humanities. Making data publicly available, writing open-source code, or publishing in open-access journals are not just ideals, but often the very glue that binds the field together. It’s one of the aspects of digital humanities that I find most appealing. Despite this, I have only slowly begun to put this ideal into practice. Earlier this year, for instance, I posted over one hundred book summaries I had compiled while studying for my qualifying exams. Now I’m venturing into the world of open-source by releasing a program I used in a recent research project.

The program tries to tackle one of the fundamental problem facing many digital humanists who analyze text: the gap between manual “close reading” and computational “distant reading.” In my case, I was trying to study the geography within a large corpus of nineteenth-century Texas newspapers. First I wrote Python scripts to extract place-names from the papers and calculate their frequencies. Although I had some success with this approach, I still ran into the all-too-familiar limit of historical sources: their messiness. Namely, nineteenth-century newspapers are extremely challenging to translate into machine-readable text. When performing Optical Character Recognition (OCR), the smorgasbord nature of newspapers poses real problems. Inconsistent column widths, a potpourri of advertisements, vast disparities in text size and layout, stories running from one page to another – the challenges go on and on and on. Consequently, extracting the word “Havana” from OCR’d text is not terribly difficult, but writing a program that identifies whether it occurs in a news story versus an advertisement is much harder. Given the quality of the OCR’d text in my particular corpus, deriving this kind of context proved next-to-impossible.

The messy nature of digitized sources illustrates a broader criticism I’ve heard of computational distant reading: that it is too empirical, too precise, and too neat. Messiness, after all, is the coin of the realm in the humanities – we revel in things like context, subtlety, perspective, and interpretation. Computers are good at generating numbers, but not so good at generating all that other stuff. My computer program could tell me precisely how many times “Chicago” was printed in every issue of every newspaper in my corpus. What it couldn’t tell me was the context in which it occurred. Was it more likely to appear in commercial news? Political stories? Classified ads? Although I could read a sample of newspapers and manually track these geographic patterns, even this task proved daunting: the average issue contained close to one thousand place-names and stretched more than 67,000 words (or, longer than Mrs. Dalloway, Fahrenheit 451, and All Quiet on the Western Front). I needed a middle ground. I decided to move backwards, from the machine-readable text of the papers to the images of the newspapers themselves. What if I could broadly categorize each column of text according both to its geography (local, regional, national, etc.) and its type of content (news, editorial, advertisement, etc.)? I settled on the idea of overlaying a grid onto the page image. A human reader could visually skim across the page and select cells in the grid to block off each chunk of content, whether it was a news column or a political cartoon or a classified ad. Once the grid was divided up into blocks, the reader could easily calculate the proportions of each kind of content.

My collaborator, Bridget Baird, used the open-source programming language Processing to develop a visual interface to do just that. We wrote a program called ImageGrid that overlaid a grid onto an image, with each cell in the grid containing attributes. This “middle-reading” approach allowed me a new access point into the meaning and context of the paper’s geography without laboriously reading every word of every page. A news story on the debate in Congress over the Spanish-American War could be categorized primarily as “News” and secondarily as both “National” and “International” geography. By repeating this process across a random sample of issues, I began to find spatial patterns.

Grid with primary categories as colors and secondary categories as letters

For instance, I discovered that a Texas paper from the 1840s dedicated proportionally more of its advertising “page space” to local geography (such as city grocers, merchants, or tailors) than did a later paper from the 1890s. This confirmed what we might expect, as a growing national consumer market by the end of the century gave rise to more and more advertisements originating from outside of Texas. More surprising, however, was the pattern of international news. The earlier paper contained three times as much foreign news (relative “page space” categorized as news content and international geography) as did the later paper in the 1890s. This was entirely unexpected. The 1840s should have been a period of relative geographic parochialism compared to the ascendant imperialism of the 1890s that marked the United States’s noisy emergence as a global power. Yet the later paper dedicated proportionally less of its news to the international sphere than the earlier paper. This pattern would have been otherwise hidden if I had used either a close-reading or distant-reading approach. Instead, a blended “middle-reading” through ImageGrid brought it into view.

We realized that this “middle-reading” approach could be readily adapted not just to my project, but to other kinds of humanities research. A cultural historian studying American consumption might use the program to analyze dozens of mail-order catalogs and quickly categorize the various kinds of goods – housekeeping, farming, entertainment, etc. – marketed by companies such as Sears-Roebuck. A classicist could analyze hundreds of Roman mosaics to quantify the average percentage of each mosaic dedicated to religious or military figures and the different colors used to portray each one.

Inspired by the example set by scholars such as Bethany NowviskieJeremy Boggs, Julie Meloni, Shane Landrum, Tim Sherratt, and many, many others, we released ImageGrid as an open-source program. A more detailed description of the program is on my website, along with a web-based applet that provides an interactive introduction to the ImageGrid interface. The program itself can be downloaded either on my website or on its GitHub repository, where it can be modified, improved, and adapted to other projects.

The Launch of Tooling Up

Today marks the public launch of a project called Humanities 3.0: Tooling Up for Digital Humanities. Over the past several months I’ve been working on Tooling Up at the Bill Lane Center for the American West. The project was originally conceived in conversation with Jon Christensen, director of the center, as an outreach initiative that would offer an accessible introduction to the realm of digital humanities. With generous funding from the University’s Presidential Fund for Innovation in the Humanities, Andrew Robichaud, Rio Akasaka, Jon, and myself began work last summer on a two-track project.

The first track is a series of online essays that explore different themes and issues within digital humanities, written in a journalistic style and aimed at a graduate student or faculty member with little to no exposure to digital scholarship or research. Each essay (there will eventually be a total of seven) deals with a particular topic within digital humanities – file and data management, digital archives, text analysis, etc. The essays are written primarily by Andy, a fellow history graduate student and DH-newcomer who did a phenomenal job of tackling topics that were outside of his comfort zone. Andy’s presence brought the added benefit of helping us all to better tailor the essays towards their intended audience: the humanities scholar who, for instance, doesn’t know what XML stands for, has only vaguely heard of Zotero, and is puzzled as to how Twitter would ever be useful for an historian. The second track of Tooling Up will take place in the spring quarter through a seminar/workshop series specifically for Stanford students and faculty. The workshops will mirror the essays by providing an in-person introduction to some of “the basics” of digital humanities.

Conceptualizing and then implementing Tooling Up forced us to grapple with a lot of issues. First, what was the project’s audience? We settled on not trying to be all things to all people. The content of Tooling Up is going to be painfully basic for the majority of people that identify themselves as digital humanists. Meanwhile, those in the #alt-ac world might be disappointed in its audience tilt towards traditional academics. And, of course, there are an inordinate number of references to Stanford examples and projects. But in the end we felt that focusing on the crowd that we knew best would allow us to deliver the most effective and coherent content.

The second issue that emerged was one of ephemerality. In a way that is markedly different from other fields, digital humanities are most commonly linked to tools, whether building them or using them, and this is reflected in the very name of our project. It is difficult to avoid ArcGIS when talking about spatial analysis or Zotero when talking about file management. But in the digital age, tools rapidly become obsolete. When Andy and I were discussing what to include in an essay section on building an online community, Delicious came to mind as an example of social bookmarking. As of the end of 2010, however, the site’s entire existence is up in the air. Ephemerality. Instead of emphasizing specific tools, therefore, we decided to use broader strokes: the basic concepts, themes, or issues surrounding different topics that will (hopefully) prove more enduring.

Finally, the issue of authority. None of us working on the project would consider ourselves experts in any one of the topics discussed in Tooling Up, much less all of them. We did our best to consult other people at Stanford who we did consider experts in those areas, but the nature of this kind of project is that it is going to always feel somewhat incomplete. In that vein, we have tried to make the project fluid and ongoing. Essays will be posted as they are finished and we encourage any and all readers to leave feedback on the site’s pages – commentary that we hope will become crucial components of the essays themselves.

Digital Humanities Labs and Undergraduate Education

Over the past few months I was lucky enough to do research in Stanford’s Spatial History Lab. Founded three years ago through funding from the Andrew Mellon Foundation, the lab was grown into a multi-faceted space for conducting different projects and initiatives dealing with spatial history. Having worked in the lab as a graduate affiliate over the past nine months as well, I can attest to what a fantastic environment it provides: computers, a range of software, wonderful staff, and an overarching collaborative setting. There are currently 6-8 ongoing projects in various stages at the lab under the direction of faculty and advanced graduate students, which focus on areas ranging from Brazil to Chile to the American West. Over ten weeks this summer, eight undergraduate research assistants worked under these projects. I had the opportunity to work alongside them from start to finish, and came away fully convinced of the potential for this kind of lab setting in furthering undergraduate humanities education.

The eight students ranged from freshman to the recently-graduated, who majored in everything from history to environmental studies to computer science. Some entered the program with technical experience of ArcGIS software; others had none. Each of them worked under an existing project and were expected to both perform traditional RA duties for the project’s director and also develop their own research agenda for the summer. Under this second track, they worked towards the end goal of producing an online publication for the website based on their own original research. Led by a carefully-planned curriculum, they each selected a topic within the first few weeks, conducted research during the bulk of the summer, went through a draft phase followed by a peer-review process, and rolled out a final publication and accompanying visualizations by the end of the ten weeks. Although not all of them reached the final point of publication at the end of that time, by the final tenth week each of them had produced a coherent historical argument or theme (which is often more than I can say about my own work).

The results were quite impressive, especially given the short time frame. For instance, rising fourth-year Michael DeGroot documented and analyzed the shifting national borders in Europe during World War II. Part of his analysis included a dynamic visualization that allows the reader to see major territorial changes between 1938-1945. DeGroot concludes that one major consequence of all of these shifts was the creation of a broadly ethnically homogenous states. In “Wildlife, Neoliberalism, and the Pursuit of Happiness,” Julio Mojica, a rising junior majoring in Anthropology and Science, Technology, and Society, analyzed survey data from the late twentieth-century on the island of Chiloé in order to examine links between low civic participation and environmental degradation. Mojica concludes that reliance on the booming salmon industry resulted in greater tolerance for pollution, a pattern that manifested itself more strongly in urban areas. As a final example, senior history major Cameron Ormsby studied late-19th century land speculation in Fresno County and impressively waded into a historiographical debate over the issue. Instead of speculators serving as necessary “middle-men” between small farmers and the state, Ormsby convincingly argues that they in fact handicapped the development of rural communities.

The success of the summer program speaks not only to the enthusiasm and quality of Stanford undergraduates, but more centrally to the direction of the lab and it’s overall working environment. By fostering an attitude of exploration, creativity, and collaboration, the students were not only encouraged, but expected to participate in projects as intellectual peers. The dynamic in the lab was not a traditional one of a faculty member dictating the agenda for the RA’s. In many cases, the students had far greater technical skills and knew more about their specific subjects than the project instructor. The program was structured to give the student’s flexibility and freedom to develop their own ideas, which placed the onus on them to take a personal stake in the wider projects. In doing so, they were exposed to the joys, challenges, and nitty-gritty details of digital humanities research: false starts and dead-ends were just as important as the pivotal, rewarding “aha!” moments that come with any project. Thinking back on internships or research assistant positions, it’s difficult for me to imagine another undergraduate setting that would encourage this kind of wonderfully productive hand-dirtying process. And while I think digital humanities labs hold great potential for advancing humanities scholarship, I have grown more and more convinced that some of their greatest potential lies in the realm of pedagogy.

Topic Modeling Martha Ballard’s Diary

In A Midwife’s Tale, Laurel Ulrich describes the challenge of analyzing Martha Ballard’s exhaustive diary, which records daily entries over the course of 27 years: “The problem is not that the diary is trivial but that it introduces more stories than can be easily recovered and absorbed.” (25) This fundamental challenge is the one I’ve tried to tackle by analyzing Ballard’s diary using text mining. There are advantages and disadvantages to such an approach – computers are very good at counting the instances of the word “God,” for instance, but less effective at recognizing that “the Author of all my Mercies” should be counted as well. The question remains, how does a reader (computer or human) recognize and conceptualize the recurrent themes that run through nearly 10,000 entries?

One answer lies in topic modeling, a method of computational linguistics that attempts to find words that frequently appear together within a text and then group them into clusters. I was introduced to topic modeling through a separate collaborative project that I’ve been working on under the direction of Matthew Jockers (who also recently topic-modeled posts from Day in the Life of Digital Humanities 2010). Matt, ever-generous and enthusiastic, helped me to install MALLET (Machine Learning for LanguagE ToolkiT), developed by Andrew McCallum at UMass as “a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.” MALLET allows you to feed in a series of text files, which the machine will then process and generate a user-specified number of word clusters it thinks are related topics. I don’t pretend to have a firm grasp on the inner statistical/computational plumbing of how MALLET produces these topics, but in the case of Martha Ballard’s diary, it worked. Beautifully.

With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:

  • MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
  • CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
  • DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
  • GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
  • SHOPPING: lb made brot bot tea butter sugar carried oz chees pork candles wheat store pr beef spirit churnd flower
  • ILLNESS: unwell mr sick gave dr rainy easier care head neighbor feet relief made throat poorly takeing medisin ts stomach

When I first ran the topic modeler, I was floored. A human being would intuitively lump words like attended, reverend, and worship together based on their meanings. But MALLET is completely unconcerned with the meaning of a word (which is fortunate, given the difficulty of teaching a computer that, in this text, discoarst actually means discoursed). Instead, the program is only concerned with how the words are used in the text, and specifically what words tend to be used similarly.

Besides a remarkably impressive ability to recognize cohesive topics, MALLET also allows us to track those topics across the text. With help from Matt and using the statistical package R, I generated a matrix with each row as a separate diary entry, each column as a separate topic, and each cell as a “score” signaling the relative presence of that topic. For instance, on November 28, 1795, Ballard attended the delivery of Timothy Page’s wife. Consequently, MALLET’s score for the MIDWIFERY topic jumps up significantly on that day. In essence, topic modeling accurately recognized, in a mere 55 words (many abbreviated into a jumbled shorthand), the dominant theme of that entry:

“Clear and pleasant. I am at mr Pages, had another fitt of ye Cramp, not So Severe as that ye night past. mrss Pages illness Came on at Evng and Shee was Deliverd at 11h of a Son which waid 12 lb. I tarried all night She was Some faint a little while after Delivery.”

The power of topic modeling really emerges when we examine thematic trends across the entire diary. As a simple barometer of its effectiveness, I used one of the generated topics that I labeled COLD WEATHER, which included words such as cold, windy, chilly, snowy, and air. When its entry scores are aggregated into months of the year, it shows exactly what one would expect over the course of a typical year:

Cold Weather

As a barometer, this made me a lot more confident in MALLET’s accuracy. From there, I looked at other topics. Two topics seemed to deal largely with HOUSEWORK:

1. house work clear knit wk home wool removd washing kinds pickt helping banking chips taxes picking cleaning pikt pails

2. home clear washt baked cloaths helped washing wash girls pies cleand things room bak kitchen ironed apple seller scolt

When charted over the course of the diary, these two topics trace how frequently Ballard mentions these kinds of daily tasks:

Housework

Both topics moved in tandem, with a high correlation coefficient of 0.83, and both steadily increased as she grew older (excepting a curious divergence in the last several years of the diary). This is somewhat counter-intuitive, as one would think the household responsibilities for an aging grandmother with a large family would decrease over time. Yet this pattern bolsters the argument made by Ulrich in A Midwife’s Tale, in which she points out that the first half of the diary was “written when her family’s productive power was at its height.” (285) As her children married and moved into different households, and her own husband experienced mounting legal and financial troubles, her daily burdens around the house increased. Topic modeling allows us to quantify and visualize this pattern, a pattern not immediately visible to a human reader.

Even more significantly, topic modeling allows us a glimpse not only into Martha’s tangible world (such as weather or housework topics), but also into her abstract world. One topic in particular leaped out at me:

feel husband unwel warm feeble felt god great fatagud fatagued thro life time year dear rose famely bu good

The most descriptive label I could assign this topic would be EMOTION – a tricky and elusive concept for humans to analyze, much less computers. Yet MALLET did a largely impressive job in identifying when Ballard was discussing her emotional state. How does this topic appear over the course of the diary?

Emotion

Like the housework topic, there is a broad increase over time. In this chart, the sharp changes are quite revealing. In particular, we see Martha more than double her use of EMOTION words between 1803 and 1804. What exactly was going on in her life at this time? Quite a bit. Her husband was imprisoned for debt and her son was indicted by a grand jury for fraud, causing a cascade effect on Martha’s own life – all of which Ulrich describes as “the family tumults of 1804-1805.” (285) Little wonder that Ballard increasingly invoked “God” or felt “fatagued” during this period.

I am absolutely intrigued by the potential for topic modeling in historic source material. In many ways, it seems that Martha Ballard’s diary is ideally suited for this kind of analysis. Short, content-driven entries that usually touch upon a limited number of topics appear to produce remarkably cohesive and accurate topics. In some cases (especially in the case of the EMOTION topic), MALLET did a better job of grouping words than a human reader. But the biggest advantage lies in its ability to extract unseen patterns in word usage. For instance, I would not have thought that the words “informed” or “hear” would cluster so strongly into the DEATH topic. But they do, and not only that, they do so more strongly within that topic than the words dead, expired, or departed. This speaks volumes about the spread of information – in Martha Ballard’s diary, death is largely written about in the context of news being disseminated through face-to-face interactions. When used in conjunction with traditional close reading of the diary and other forms of text mining (for instance, charting Ballard’s social network), topic modeling offers a new and valuable way of interpreting the source material.

I’ll end my post with a topic near and dear to Martha Ballard’s heart: her garden. To a greater degree than any other topic, GARDENING words boast incredible thematic cohesion (gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds) and over the course of the diary’s average year they also beautifully depict the fingerprint of Maine’s seasonal cycles:

Gardening

Note: this post is part of an ongoing series detailing my work on text mining Martha Ballard’s diary.

Chasing the “Perfect Data” Dragon

Whenever I put on my proselytizing robes to explain the potential of digital humanities to a layperson, I usually point towards the usual data deluge trope. “If you read a book a day for the rest of your life, it would take you 30-something lifetimes to read one million books. Google has already digitized several times that number.” etc. etc. The picture I end up painting is one where the DH community is better-positioned than traditional academics to access, manipulate, and draw out meaning from the growing mountains of digital data. Basically, now that all this information is digitized, we can feed the 1’s and 0’s into a machine and, presto, innovative scholarship.

Of course, my proselytizing is a bit disingenuous. The dirty little secret is that not all data is created equal. And especially within the humanist’s turf, most digitized sources are rarely “machine-ready”. The more projects I work on, the more and more convinced I become that there is one real constant to them: I always spend far more time than I expect preparing, cleaning, and improving my data. Why? Because I can.

A crucial advantage to digital information is that it’s dynamic and malleable. You can clean up a book’s XML tags, or tweak the coordinates of a georectified map, or expand the shorthand abbreviations in a digitized letter. Which is all well and good, but comes with a pricetag. In a way that is fundamentally different from the analog world, perfection is theoretically attainable. And that’s where an addictive element creeps into the picture. When you can see mistakes and know you can fix them, the temptation to both find and fix every single one is overwhelming.

In many respects, cleaning your data is absolutely crucial to good scholarship. The historian reading an 18th-century newspaper might know that “Gorge Washington” refers to the first president of the United States, but unless the spelling error gets fixed, that name probably won’t get identified correctly by a computer. Of course, it’s relatively easy to change “Gorge” to “George”, but what happens when you are working with 30,000 newspaper pages? Manually going through and fixing spelling mistakes (or, more likely, OCR mistakes) defeats the purpose and neuters the advantage of large-scale text mining. While there are ways to automate this kind of data cleaning, most methods are going to be surprisingly time-intensive. And once you start down the path of data cleaning, it can turn into whack-a-mole, with five “Thoms Jefferson”s poking their heads up out of the hole for every one “Gorge Washington” you fix.

Chasing the “perfect data” dragon becomes an addictive cycle, one fueled by equal parts optimism and fear. Having a set of flawlessly-encoded Gothic novels could very well lead to the next big breakthrough in genre classification. On the other hand, what if all those missed “Gorge Washingtons” are the final puzzle pieces that will illuminate early popular conceptions of presidential power? The problem is compounded by the fact that, in many cases, the specific errors can be fixed. But in breathlessly attempting to meet the “data deluge” problem, the number and kind of specific errors get multiplied by several orders of magnitude over increasingly larger and larger bodies of information and material – which severely complicates the ability to both locate and rectify all of them.

At some point, the digital material has to simply be “good enough”. But breaking out of the “perfect data” dragon-chasing is easier said than done. “How accurate does my dataset have to be to in order to be statistically relevant?” “How do I even know how clean my data actually is?” “How many hours of my time is it worth to bump up the data accuracy from 96% to 98%?” These are the kinds of questions that DH researchers suddenly struggle with – questions that a background in the humanities ill-prepares them to answer. Just like so many aspects of doing this kind of work, there is a lot to learn from other disciplines.

Certain kinds of data quality issues get mitigated by the “safety in numbers” approach. Pinpointing the exact cross-streets of a rail depot is pretty important if you’re creating a map of a small city. But if you’re looking at all the rail depots in, say, the Midwest, the “good enough” degree of locational error gets substantially bigger. Over the course of thirty million words, the number of “George Washingtons” are going to far outweigh and balance out the number of “Gorge Washingtons”. With large-scale digital projects, it’s easier to see that chasing the “perfect data” dragon is both impossible and unnecessary. On the other hand, certain kinds of data quality problems get magnified with a larger scale. Small discrepancies get flattened out with bigger datasets. But foundational or commonly-repeated errors get exaggerated with a larger dataset, particularly if some errors have been fixed and others not. For instance, if you fixed every “Gorge Washington” but didn’t catch the more frequently misspelled “Thoms Jefferson”, comparing the textual appearances of the two presidents over those thirty million words is going to be heavily skewed in George’s direction.

As non-humanities scholars have been demonstrating for years, these problems aren’t new and they aren’t unmanageable. But as digital humanists sort through larger and larger sets of data, it will become increasingly important to know when to ignore the dragon and when to give chase.

Text Analysis of Martha Ballard’s Diary (Part 3)

One of the most basic applications of text mining is simply counting words. I began by stripping out punctuation (in order to avoid differentiating mend and mend. as two separate words), put every word into lowercase, and then ignored a list of stop words (the, and, for, etc.). By writing a program to count occurrences of the 500 most common words, I could get a general (and more quantitative) sense for what general topics Martha Ballard wrote about in her diary. Unsurprisingly, her vocabulary usage followed a standard path of exponential decay: like most people, she utilized a relatively small number of words with extreme frequency. For example, the most common word (mr) occurred 10,050 times, while her 500th most common word (relief) occurred 67 times:

Top500Words

Because each word has information attached to it – specifically what date it was written – we can look at long-term patterns for a particular word’s usage. However, looking at only raw word frequencies can be problematic. For example, if Ballard wrote the word yarn twice as often in 1801 as 1791, it could mean that she was doing a lot more knitting in her old age. But it could also mean that she was writing a lot more words in her diary overall. In order to address this issue, for any word I was examining I made sure to normalize its frequency – first by dividing it by the total word count for that year, then by dividing it by the average usage of the word over the entire diary. This allowed me to visualize how a word’s relative frequency changed from year to year.

In order to visualize the information, I settled on trying out sparklines: “small, intense, simple datawords” advocated by infographics guru Edward Tufte and meant to give a quick, somewhat qualitative snapshot of information. To test my method, I used a theme that Laurel Ulrich describes in A Midwife’s Tale: land surveying. In particular, during the late 1790s Martha’s husband Ephraim became heavily involved in surveying property. In the raw word count list, both survey and surveying appear in the top 500 words, so I combined the two and looked at how Martha’s use of them in her diary changed over the years (1785-1812):

survey_surveying survey(ing)

Looking at the sparkline, we get a visual sense for when surveying played a larger role in Martha’s diary – around the middle third, or roughly 1795-1805, which corresponds relatively well to Ulrich’s description of Ephraim’s surveying adventures. As a basis for comparison, the word clear appeared with numbing regularity (almost always in reference to the weather):

clear clear

Using word frequencies and sparklines, I could investigate and visualize other themes in the diary as well.

Religion

Out of the 500 most frequent words in the diary, only three of them relate directly to religion: meeting (#28), worship (#143), and god (#220).

meeting meeting

worship worship

god god

Meeting, which was used largely in a religious context (going to a church meeting), but also in a socio-political context (attending town meetings), had a relatively consistent rate of use, although it trended slightly upwards over time. Worship (which Martha largely used in the sense of “went to publick worship”), meanwhile, was more erratic and trended slightly downwards. Finally, and perhaps most interestingly, was Martha’s use of the word god. Almost non-existent in the first third of her diary, it then occurred much more frequently, but also more erratically over the final two-thirds of the diary. Not only was it a relatively infrequent word overall (flax, horse, and apples occur more often), but its usage pattern suggests that Martha Ballard did not directly invoke a higher power on a personal level with any kind of regularity (at least in her diary). Instead, she was much more comfortable referring to the more socially and community-based activity of attending a religious service. While a qualitative close reading of the text would give a richer impression of Martha’s spirituality, a quantitative approach demonstrates how little “real estate” she dedicates to religious themes in her diary.

Death

death death

dead dead

funeral funeral

expired expired

interd interd

Most of the words related to death show an erratic pattern. There are peaks and valleys across the years without much correlation between the different words, and the only word that appears with any kind of consistency is interd (interred). In this case, word frequency and sparklines are relatively weak as an analytical tool. They don’t speak to any kind of coherent pattern, and at most they vaguely point towards additional questions for study – what causes the various extreme peaks in usage? Is there a common context with which Martha uses each of the words? Why was interd so much flatter than the others?

Family

In this final section, I’ll offer up a small taste of how analyzing word frequency can reveal interpersonal relationships. I used the particular example of Dolly (Martha’s youngest daughter):

dolly dolly

The sparkline does a phenomenal job of driving home a drastic change in how Martha refers to her daughter. In a matter of a year or two in the mid 1790s, she goes from writing about Dolly frequently to almost never mentioning her. Why? Some quick detective work (or reading page 145 in A Midwife’s Tale) shows that the plummet coincides almost perfectly with Dolly’s marriage to a man named Barnabas Lambart in 1795. But why on earth would Martha go from mentioning Dolly all the time in her diary to going entire years without writing her name? Did Martha disapprove of her daughter’s marriage? Was it a shotgun wedding?

The answer, while not so scandalous, is an interesting one nonetheless that text analysis and visualization helps to elucidate. In short, Martha still writes about her daughter after 1795, but instead of referring to her as Dolly, she begins to refer to her as Dagt Lambd (Daughter Lambert). This is a fascinating shift, and one whose full significance might get lost by a traditional reading. A human poring over these detailed entries might get a vague impression that Martha has started calling her daughter something different, but the sparkline above drives home just how abrupt and dramatic that transformation really was. Martha, by and large, stopped calling her youngest daughter by her first name and instead adopted the new husband’s proper name. Such a vivid symbolic shift opens up a window into an array of broader issues, including marriage patterns, familial relationships, and gender dynamics.

Conclusions

Counting word frequency is a somewhat blunt instrument that, if used carefully, can certainly yield meaningful results. In particular, utilizing sparklines to visualize individual word frequencies offers up two advantages for historical inquiry:

  1. Coherently display general trends
  2. Reveal outliers and anomalies

First, sparklines are a great way to get a quick impression of how a word’s use changes over time. For example, we can see above that the frequency of the word expired steadily increases throughout the diary. While this can often simply reiterate suspected trends, it can ground these hunches in refreshingly hard data. By the end of the diary, a reader might have a general sense for how certain themes appear, but a text analysis can visualize meaningful patterns and augment a close reading of the text.

Second, sparklines can vividly reveal outliers. In the course of reading hundreds of thousands of words over the course of nearly 10,000 entries, it’s quite easy to lose sight of the forest for the trees (to use a tired metaphor). Visualizing word frequencies allows historians to gain a broader perspective on a piece of the text, and they also act as signposts pointing the viewer towards a specific area for further investigation (such the red-flag-raising rupture in how frequently Dolly appears). Relatively basic word frequency by itself (such as what I’ve done here) does not necessarily explain anomalies, but it can do an impressive job of highlighting important ones.

Playing Well With Others

One of the sharper distinctions between the digital humanities and traditional scholars is an acceptance and emphasis on collaboration. Lisa Spiro has written several convincing posts that detail how scholars in the digital humanities are far more likely to work together and co-author essays, along with some examples of collaborative projects. At the NEH’s Office for Digital Humanities, the first requirement for applying to a grant for a fellowship at a Digital Humanities Center is to: “support innovative collaboration on outstanding digital research projects.” Meanwhile, many disciplines within the humanities cling to the notion of the individual scholar. Cathy Davidson of HASTAC tells the story of job-seeking and being told that collaborative work didn’t “count” as legitimate scholarship: “I felt like Hester Prynne wearing her Scarlet A . . . for Adulterous Authorship.” The academy remains enamored with putting a single face and a single name to research; the vast majority of the annual prizes given by the AHA are presented to individual historians for individual work.

The reasons for this distinction are easy to understand. Most digital humanities initiatives are inherently multidisciplinary. There are those among us lucky or hard-working enough to possess both “soft” humanistic talent and “hard” technical skills, but for the majority of us it is much more efficient and effective to split the workload of multiple, and often very different, approaches between more than one person. Why spend six months trying to master the intricacies of MySQL when you can team up with a colleague who already knows how to implement it? Teaming up with other people across disciplines is a form of self-preservation that saves everyone time and energy.

Another reason for the distinction often stems from the basic nature of the projects – many digital humanists have focused on building tools, online collections, and interactive media. Whereas as most academic monographs are aimed at an audience of fellow academics, these projects are inherently designed with a broader public in mind. With that overarching goal, collaboration during the production phase becomes an almost instinctive (and necessary) pursuit. Similarly, scholarly specialization leads to (often) intense intellectual turf wars. If you are struggling to make your academic mark on a very specific focus within a very specific sub-field, other people working on that same field can often seem more like a threat than a resource. These jealously guarded barriers are less prevalent within the digital humanities community, given its emphasis on greater transparency and a broader scope of study.

This is not to say that traditional humanists are allergic to collaboration. Established (read: tenured) professors are often much more willing to edit volumes, co-author essays, and work together on research projects. When you are a successful author and Harvard historian like Jill Lepore, you can afford to take a chance and co-write a work of historical fiction. An associate professor at a small state school struggling to get tenure? Not so much. Younger scholars are still plagued by the never-ending issue of digital scholarship not “counting” as a valid accomplishment.

Most graduate (particularly Ph.D) programs in the humanities simply do not train their students to play well (or at all) with others. Writing a dissertation is still viewed as an infamously lonesome pursuit. Doing so establishes your credentials as an individual scholar capable of producing original work. Unfortunately, this not only reinforces the conception that anything other than individual research is somehow less valued, but it also does a terrible job of preparing students to do any kind of future collaborative work. Learning how to take notes in an archive or write manuscript chapters are critical skills, but so is learning how to delegate tasks to research partners or co-author a grant proposal.

There is no reason why the traditional humanities cannot begin to embrace scholarly collaboration. Even for those with no interest in digital initiatives, increased collaboration creates a ripple effect. There are the obvious benefits: different perspectives add richness and depth to studies, a division of labor and specialization can lead to greater efficiency, and more collaborators often facilitates future connections across otherwise-insular academic networks. Almost every scholar has the story of a single conversation, comment, or idea from a colleague, friend, or family member sparking a revelation or major advancement in their work. Official collaboration only magnifies this effect, and the academy as a whole would benefit.

Collaboration is not a cure-all, and it presents its own set of quite-formidable challenges. As every high-schooler working on a group project or cubicle-dweller sitting in a meeting can tell you, working with other people can often be a frustrating experience. How do you divide up responsibilities, reconcile different opinions, share both criticism and credit? A professor of literature sitting across the table from a computer scientist will probably have a lot of trouble communicating effectively with each other. All of these issues have the potential to be even sharper inside the humanities, where most scholars have been given little to no official instruction or practical experience in how to work together. Nevertheless, the potential for concerted collaboration to spur on academic discovery within the humanities is simply too high to ignore.

Reflections on Blogging

It’s now been over a year since I started history-ing and over a month since my last post, so I thought I’d ease back into writing by reflecting on a year in the blogosphere.

1. Intellectual stimulation

One of the most jarring changes going from a college lifestyle to the workforce was the lack of academic stimulation on a daily basis. and problem-solving transitioned from a classroom to the office. Having a blog gave me an impetus to really think about issues. It forced me to write (semi) regularly, to think about issues, to engage in at least a limited conversation on intellectual topics I cared about. Instead of being a passive consumer of ideas, posts, articles, essays, and books, I became an active one.

The knowledge that my writing would be open and available for anyone to read and judge made me think even harder to develop my own ideas and opinions. If you write a shitty paper in a college seminar, the professor gives you a shitty grade and you file it away. If you write a shitty post, it’s out there for anyone to read. Employers, colleagues, professors, admissions people – all of them now have a growing body of my writing to read, disagree with, and critique if they’re so inclined. For an unestablished scholar like myself, this provides some major motivation to really think and work at what I write.

2. Joining a community

Blogging also let me jump into a vibrant online community of digital historians and humanists. Instead of being something of a sideline observer, I laced up and joined the fray. Doing so not only exposed me to a wide range of new ideas and possibilities, but also introduced me to a number of fascinating and inspiring people – many of whom I met in person at the AAHC and THATCamp conferences. Especially for a younger scholar like myself, having a blog gave me confidence in my credentials and allowed me to participate in a wider dialogue.

Moving forward, the connections I’ve made through blogging (and on a noisier level, Twitter) will serve me for a long time to come. I’ve been lucky in that before I’ve even stepped foot inside a graduate classroom, I’ve have had the opportunity to interact with so many people who I (hope) will be my future colleagues and collaborators. In the insular world of traditional academics, this is a relative rarity.

3. Feedback

I’m a firm believer that there’s no point in writing into a void. While much of my blogging was “for myself,” in that I wrote about what interested me, the most rewarding part by far is the response I’ve received. There is certainly an egotistical and superficial element to checking  site-visit stats. But there is some validity to the point that my writing has already reached a larger audience in a year than all of my undergraduate writing put together. By a long shot. For example, my most popular post, by almost a 2:1 factor, is a rudimentary text analysis of Venture Smith’s narrative. As of today, it had been viewed over a thousand times. This metric might be a tiny drop in the blogosphere bucket, but it will certainly eclipse any audience I’ll have for my traditional academic research, at least in the near future.

One of the more rewarding episodes occurred recently, when a local Connecticut writer contacted me through my blog because she was interested in  Venture Smith. She had stumbled across my posts talking about my undergraduate research on Venture Smith, and had been inspired to do some truly remarkable research on her own. We met yesterday, and I was thrilled to find that not only had she uncovered a fascinating new development, but that it directly related to work I had done. I was humbled to hear that my blog had been an impetus for her to get involved in the Venture Smith community. It served as a great reminder of how blogs can increase transparency and lower barriers between academics and the wider public.

I’m not sure what the future will hold for history-ing. There are bad as well as good aspects of maintaining a blog, and it remains to be seen whether it will survive the time-drain of graduate school. Regardless, blogging at history-ing has been, and I hope will continue to be, an enriching experience.