American Panorama: Part II

This is the second half of a review of American Panorama (you can read Part I here). Together, the two posts are a follow-up to my earlier call for digital historians to more actively engage with the historical contributions of each other’s projects.

Part II. The Overland Trails, 1840-1860

Between 1840 and 1860 several hundred thousand people traveled westward across the United States, most of them ending up in California, Oregon, and Utah. Their migration has become a foundational element of American history, conjuring up visions of covered wagons and hardy pioneers. Or, if you grew up playing the educational computer game The Oregon Trail: floppy disks, pixelated oxen, and exciting new words like “dysentery.” The topic has been exhaustively studied by genealogists, historians, and millions of schoolchildren over the years. American Panorama attempts to break new ground on what is, like the trail itself, well-trodden soil.

The Overland Trails follows a similar visual layout as The Forced Migration of Enslaved People, with multiple panes showing a map, a timeline, aggregated data, and the expandable text from twenty-two trail diaries. Far more so than The Forced Migration of Enslaved People, however, it puts these written narratives into the spotlight. The visualization includes the full text of each diary rather than brief excerpts. Clicking on a specific diarist allows you to read all of their entries, with a linked footnote to the original source. As you scroll through the entries, clusters of dots track the progress of the emigrant’s journey on the map as they pass between landmarks like Courthouse Rock or Fort Laramie.

OverlandTrailAnimation

Two other panes provide context for that particular year: a short summary of trail activity and a small map breaking down the estimated annual migration to California, Oregon, and Utah. The timeline uses small multiples for each year that plot the seasonal progression of emigrant journeys on its x-axis and, somewhat confusingly, the (horizontal) longitude coordinates of these journeys on its vertical axis. Timeline aside, the overall reading experience is both intuitive and seamless. More importantly, the visualization strikes a balance between detail and context, weaving the full text of individual sources within a larger spatial and historical tapestry. In many ways, this is digital design at its best. But why does this elegant design matter? What is the historical payoff? The Overland Trails makes two contributions to the topic of westward migration – one archival and the other interpretive.

First, The Overland Trails gives us not just a new, but a better platform for reading and understanding the topic’s source base. The trail diary was a genre unto itself during the mid-nineteenth century. They were often written to serve as a kind of guide to help family or friends follow them westward, recording daily mileage, landmarks, trail quality, and the availability of water and grass. These details made the diaries immensely helpful for future emigrants, but immensely boring for future historians. Take an entry written by James Bennett on July 12th, 1850:

Friday 12th-After ten miles travel this day over a heavy, sandy and barren road, we reached Sweet Water river, where we took dinner. Here we found the grass very short and as our cattle were nearly exhausted by hard work and scant feed, we drove off the road five miles to the right, where we found excellent grass and a good spring.

Now imagine reading thousands of entries exactly like this one. You start to get hungry for anything that breaks the monotony of the trail: white-knuckled river crossings, exchanges with passing Indians, or fiery arguments about whether or not to travel on the Sabbath. Moreover, as a reader we often don’t care all that much about where these juicy episodes took place – does it really matter if they occurred in western Nebraska, northern Utah, or eastern Oregon? The nebulous space of “The Trail Experience” serves as a stand-in for specific geography of where things happened. But the loss of geographic context risks distorting the lived reality of nineteenth-century emigrants. For them, trail life was overwhelmingly defined by geography: boring, repetitive, grinding travel along an established trail itinerary, with mileage tallies or landmark notations acting as a means of marking their progress through that geography. American Panorama captures the experience of overland travel far more effectively than simply reading trail diaries on their own. As simple as it sounds, linking individual entries to their location on a map illustrates the small-scale, incremental geography that made up this massive, large-scale migration.

The second historical contribution of The Overland Trails involves a broader spatial reinterpretation of westward expansion. The phrase itself – “western expansion” conjures up the image of a wave of Anglo-American settlers washing over the continent. This was the geography embedded in Manifest Destiny iconography and Frederick Jackson Turner’s famous frontier thesis.

Emanuel_Leutze_-_Westward_the_Course_of_Empire_Takes_Its_Way_-_Smithsonian
Source: Wikimedia Commons

American Panorama presents a much different geography. Western migration was not a wave; it was a narrow river. Hundreds of thousands of people may have traveled across the western interior between the 1840 and 1860, but they did so along a severely restricted corridor of travel. This might seem obvious; the Overland Trail was, after all, a trail. But the trail’s meaning has come to embody a certain idea of mobility, not just in terms of traveling westward to Oregon or California, but of experiencing and claiming the vast swath of land that lay in between. When mapped, however, the journeys of twenty-two emigrants resemble tightly braided cords that only gradually fray as they approach the Pacific Coast. Overland travelers operated in a tightly constrained space.

OverlandTrail_1

To take one example: although emigrants technically traversed from one side of Nebraska Territory to the other, most travelers didn’t see very much of it. The grinding necessity of daily travel kept them pinned along the Platte River. American Panorama illustrates just how narrow this pathway was and how infrequently emigrants deviated from it.

OverlandTrail_Zoom1

In the mid-nineteenth century, the interior of the western United States was seen as a region to pass through as quickly as possible, an area that had long been labeled “The Great American Desert,” or in historian Elliott West’s words, “a threatening void.” (The Contested Plains, 122) Much of the western interior was made up of territory that was ostensibly claimed by the United States but that remained largely ungoverned and unsettled by Anglo-Americans. American Panorama effectively recreates this geography through visual design: bright, sharp lines track the emigrants’ journeys along the trail, interspersed with landmarks and forts shown in equally bright colors. This tightly demarcated trail geography pops out from the map as it snakes across a minimalist base layer entirely devoid of the familiar political boundaries of states or territories. Instead, the underlying map consists of terrain, sparse water features, and the locations of Indian groups such as the Cheyenne in the central plains or the Goshute near Great Salt Lake. The Overland Trails manages to capture the experience of traversing a semi-arid, mountainous region still occupied by native people, one that was seen as largely off-limits for Anglo-American settlement.

The project’s cartographic achievement comes with a cost, however. The presence of native groups played a crucial role in shaping mid-century views of the interior. As historian Susan Schulten notes, “erasing Native Americans from both mental and actual maps” (29) was a central process in the eventual shift from seeing the western interior as an inviting area to settle rather than a forbidding area to traverse. To their credit, the designers of The Overland Trails put native people back on the map. The problem comes from the way in which they do so. The mapmakers label Indian groups using a muted gray color that is nearly identical to the map’s base terrain. Moreover, changing the zoom level causes some labels to shift locations or disappear entirely in order to avoid overlapping with the trail and its landmarks. The overall effect is to weave native groups into the natural landscape, making them visually analogous to the map’s rivers or mountains. This cartographic design ends up conflating native people and the environment – a deeply problematic notion that remains stubbornly lodged in the popular imagination. The visualization builds a marvelous stage for overland emigrants, but its set design turns Indians into a backdrop.

OverlandTrail_Zoom.png

I don’t mean to quibble over (literal) shades of gray. After all, the map’s creators made a concerted effort to include Indian groups – the same can’t be said of other many other historical projects, digital or otherwise. But the project’s cartography highlights a common tension between digital design and historiography. From a design standpoint, the creators of The Overland Trails make all the right decisions. Brightly colored overland routes are foregrounded against a muted base map, including unobtrusive gray labels of Indian groups that give readers contextual information while keeping their attention firmly focused on the emigrant journeys themselves. When those same labels disappear or change locations depending on the zoom level, it helps avoid visual clutter. The problem is that effective digital design can run headlong into fraught historiographical issues, including the contentious idea of the “ecological Indian” and a longstanding cartographic tradition of using maps to marginalize and erase native claims to territory in the West.

Visual design is not the only sticking point for The Overland Trails and its place within western historiography. The visualization is, at its core, a digital archive of primary sources. As I’ve already noted, its interface contributes a new and fascinating way of reading and understanding these sources. What troubles me is the privileging of this particular archive. To be blunt: do we really need a new way of reading and understanding the experience of mostly white, mostly male pioneers whose stories already occupy such a central place in American mythology?

The historical commemoration of overland emigrants began almost as soon as their wagons reached the Pacific Coast. Western pioneer associations held annual conventions and published nostalgic reminiscences that romanticized their journeys. Historians, meanwhile, largely followed the blueprint of Frederick Jackson Turner, who immortalized the march of pioneer-farmers carrying the mantle of civilization westward. Nearly a century passed before historians began to reassess this framework, from uncovering the ways that gender shaped life on the trail to, more recently, interpreting overland migration as a “sonic conquest.” (to use Sarah Keyes’s formulation).

More often than not, however, historical treatments of the Overland Trail still tend to resemble book titles like Wagons West: The Epic Story of America’s Overland Trails, or quotes like, “An army of nearly half a million ragged, sunburned civilians marched up the Platte in the vanguard of empire…they emerge from their collective obscurity to illuminate a heroic age in American history.” (Merrill Mattes, Platte River Road Narratives, xiv) The Overland Trails doesn’t explicitly advance this viewpoint, but nor does it move away from it in any substantive way. The informational text accompanying the visualization’s timeline can, at times, read like a “greatest hits” of western lore: the Donner Party, the Gold Rush, Indian fighting, and the Pony Express (its freshest material centers on Mormon migration). The visualization’s space constraints leave precious little room for important historical nuance, leading to generalizations such as “White settlement in the West was disastrous for Indians everywhere.”

To reiterate a point I made in the first part of my review of American Panorama: prioritizing user exploration over authorial interpretation comes with risks. I don’t want to minimize the significance of The Overland Trails, because it contributes a truly valuable new interface for conceptualizing nineteenth-century historical geography and the experience of overland travel. But the project uses a novel framework to deliver largely tired content. My guess is that its selection of content was based on the fact that these particular diaries were already digitized. This kind of pragmatism is a necessary part of digital history. But explaining the interpretive implications of these decisions, not just the nitty-gritty methodological details, often requires a more robust and explicit authorial voice than many digital history projects seem willing to provide.

My hope is that The Overland Trails will serve as a prototype for visualizing other movement-driven sources. To that end, American Panorama has given outside researchers the ability to build on this framework by making the project’s source code available on Github.  The Github repository highlights the open-ended nature of the project, as its creators continue to improve its visualizations. In a similar vein, American Panorama‘s team has several new visualizations to come that examine redlining, urban renewal, and presidential voting.  I have high expectations, and I hope that other historians will join me in giving them the substantive engagement they deserve.

 

American Panorama: Part I

I recently wrote about the wave of digital history reviews currently washing over print journals like the American Historical Review, The Western Historical Quarterly, and The Journal of American History. This wave brings into focus the odd reticence of digital historians to substantively review digital history projects in open, online venues. I ended the post with a call for the field to more actively engage with the work of our peers and, in particular, to evaluate the historical contributions of these digital projects if and when they fall within our areas of subject expertise. The following is my attempt to do just that.

AmericanPanorama_Landing

American Panorama: An Atlas of United States History was released in December 2015 by the University of Richmond’s Digital Scholarship Lab. It is a collection of four map-based visualizations focusing on different topics in American history: slave migration, immigration to the U.S., canal construction, and the Overland Trails. Each of these visualizations revolve around an interactive map, with surrounding panes of charts, timelines, contextual data, and primary sources related to the topic. If I could summarize the project’s historical contributions in a single sentence, it would be this one: American Panorama incorporates movement into the history of the United States. To be even more specific, the project shines a new light on the historical movement of people. Its three most compelling visualizations (foreign immigration, slave migration, and the Overland Trails) illustrate some of the most monumental shifts of people in American history. There are certainly other episodes of travel and migration worth studying – Indian Removal or the Great Migration immediately jump to mind – but those selected by American Panorama are certainly three of the most consequential.

Like most digital history projects, American Panorama is a collaboration. Unlike most digital history projects, it’s a collaboration between academic historians and a private company. The Digital Scholarship Lab’s Robert Nelson, Ed Ayers, Scott Nesbit (now at the University of Georgia), Justin Madron, and Nathaniel Ayers make up the academic half of the project. The private half of the partnership is Stamen Design, a renowned data visualization and design studio that has worked with clients ranging from Toyota and AirBnB to the U.S. Agency for International Development. Stamen is also, in the words of tech journalist Alexis Madrigal, “perhaps the leading creator of cool-looking maps.” Stamen’s fingerprints are all over American Panorama. The visualizations are beautifully structured, deeply immersive, and packed with information. In fact, data depth and data density are the hallmarks of these visualizations – I don’t think I’ve ever seen this much historical content visualized in this many different ways, all within a single browser window. Furthermore, the project’s visual interface presents a new and valuable framework to understand the scale of people movements in a way that written narratives can struggle to convey. Writing about thousands or even millions of people moving around over the course of years and decades can often devolve into an abstract swirl of numbers, states, regions, and dates. American Panorama makes that swirl intelligible.

The project encapsulates many of the current hallmarks of digital history. It is aimed at a broad public audience and was “designed for anyone with an interest in American history or a love of maps.” Relatedly, the project is exploratory and descriptive rather than explicitly interpretive, and offers only hints at how the reader should understand and interpret patterns. Outside of brief and rather modest textual asides, readers are largely left to make their own discoveries, construct their own narratives, and draw their own conclusions. The common justification for creating exploratory visualizations rather than argumentative or narrative-driven ones is that they encourage participatory engagement. Empowering readers to control how they interact with a visualization nudges them to delve deeper into the project and emerge with a richer understanding of the topic. But an exploratory framework hinges on a reader’s abilities and willingness to discover, narrate, and interpret the project for themselves.

To take one example, American Panorama’s Foreign-Born Population, 1850-2010 offers by far the strongest interpretive stance out of the project’s four visualizations: “American history can never be understood by just looking within its borders.” Even so, the creators consign their interpretation to a short, solitary paragraph in the About This Map section, leaving readers to draw their own conclusions about the meaning and implications of this message. The tech blog Gizmodo, for instance, covered the project’s release under the headline: “See The US Welcome Millions Of Immigrants Over 150 Years In This Interactive Map.” Internet headlines have never exactly been a bastion of nuance, but to say that the U.S. “welcomed” immigrants is, well, not very accurate. It’s also an example of the kind of historical mischaracterization that can arise when projects push authorial interpretation into the background.

Full disclosure: I know and deeply admire the work of Rob Nelson, Scott Nesbit, and Ed Ayers. They are very, very smart historians, which is why I found myself wanting to hear more of their voices. What new patterns have they discovered? What stories and interpretations have they drawn from these patterns? How has the project changed their understanding of these topics? The creators of American Panorama do not answer these questions explicitly. Instead, they allow patterns, stories, and interpretations to swim just beneath the surface. This was likely a deliberate choice, and I don’t want to critique the project for failing to accomplish something that it never set out to do in the first place. American Panorama is not an academic monograph and it shouldn’t be treated as one. Nevertheless, the project left me hungry for a more explicit discussion of how it interpretation and historical literature.

I’d like to offer my own take on American Panorama using equal parts review and riff, one that combines an evaluation of the project’s strengths and weaknesses with a discussion of how it fits into themes and topics in U.S. history. To do so, I’ve focused on two visualizations: The Forced Migration of Enslaved People, 1810-1860 and The Overland Trails. Fair warning: in true academic fashion, I had far too much to say about the two visualizations, so I split the piece into two separate posts. The first is below, and the second will follow soon. (Update: you can read Part II here.)

Part I. The Forced Migration of Enslaved People, 1810-1860

In some ways, Americans remember slavery through the lens of movement. This begins with The Middle Passage, the horrifying transportation of millions of human beings from Africa to the Americas. The focus on movement then shifts to escape, perhaps best embodied in the Underground Railroad and its stirring biblical exodus from bondage to freedom. But there was a much darker, and less familiar, counterweight to the Underground Railroad: being “sold down the river” to new planting frontiers in the Deep South. The sheer volume of this movement dwarfed the far smaller trickle of runaways: between 1810 and 1860 southern planters and slave traders forced nearly one million enslaved people to move southward and westward. The Forced Migration of Enslaved People, 1810-1860 helps us understand the scale and trajectory of this mass movement of human beings.

The visualization uses a map and timeline to illustrate a clear decade-by-decade pattern: enslaved people streaming out of the Upper South and the eastern seaboard and into the cotton-growing regions of the Black Belt (western Georgia, Alabama, and Mississippi), the Mississippi River Valley, and eastern Texas and Arkansas. It shows that this shift was not uninterrupted, but came in fits and starts. The reverberations of the 1837 financial panic, for instance, dampened and diffused this movement during the 1840s. An accompanying data pane charts the in-migration and out-migration on a state and county level: during the 1830s more than 120,000 slaves left Virginia, even as 108,000 slaves streamed into Alabama. None of these findings are especially new for historians of the period, but The Forced Migration of Enslaved People brings them into sharp focus.

ForcedMigration_Data

On an interpretive level, The Forced Migration of Enslaved People helps reorient the locus of American slavery away from The Plantation and towards The Slave Market. This is part of a larger historiographical pivot, one that can be seen in Walter Johnson’s book Soul by Soul (1999). Johnson reminds us that American slavery depended not just on the coerced labor of black bodies, but on the commodification of those same bodies. It wasn’t enough to force people to work; the system depended first and foremost on the ability to buy and sell human beings. Because of this, Johnson argues that the primary sites of American slavery were slave markets in places like Charleston, Natchez, and New Orleans. Soul by Soul was an early landmark in the now flourishing body of literature exploring the relationship between slavery and capitalism. The book’s argument rested in large part on the underlying mass movement of black men, women, and children, both through slave markets and into the expanding planter frontier of the Southwest. American Panorama lays bare the full geography of this movement in all of its spatial and temporal detail.

There is a certain irony in using Walter Johnson’s Soul by Soul to discuss The Forced Migration of Enslaved People. After all, Johnson’s book includes a critique that might as well have been addressed directly to the project’s creators. He bluntly asserts that the use of maps and charts to illustrate the slave trade hides the lives and experience of the individuals that made up these aggregated patterns. Instead, Johnson calls for the kind of history “where broad trends and abstract totalities thickened into human shape.” (8) His critique echoes the debates that swirled around Robert Fogel and Stanley Engerman’s Time on the Cross (1974) and continue to swirl around the digital project Voyages: The Trans-Atlantic Slave Trade Database.

The creators of The Forced Migration of Enslaved People gesture towards the larger historiographical divide between quantification and dehumanization in an accompanying text: “Enslaved people’s accounts of the slave trade powerfully testify to experiences that cannot be represented on a map or in a chart.” Instead, they attempt to bring these two modes of history together by incorporating excerpted slave narratives alongside its maps and charts. Clicking on icons embedded in the map or the timeline reveals quotes from individual accounts that mention some dimension of the slave trade. This interface allows the reader to shift back and forth between the visual language of bars, dots, and hexbins, and the written words of formerly enslaved people themselves. The Forced Migration of Enslaved People uses a digital medium to present both the “broad trends and abstract totalities” and the “human shape” of individual lives. One of the analytical and narrative payoffs of an interactive interface is the ability to seamlessly move between vastly different scales of reading. The Forced Migration of Enslaved People breaks important new ground in this regard by blending the macro scale of demographics with the micro scale of individuals.

ForcedMigration_Expanded

Ultimately, however, the project’s attempt to combine narrative accounts and quantitative data falls short of its potential. On the whole, the scale of the individuals recedes under the scale of the data. The problem lies in the way in which the project presents its excerpted quotes. Flurries of names, places, events, and emotions appear divorced from the broader context of a particular narrative. Reading these text fragments can often feel like driving past a crash on the side of a highway. You might glimpse the faces of some passengers or the severity of the wreck, but you don’t know how they got there or what happens to them next. Then you pass another crash. And another. And another. The cumulative weight of all these dozens of wrecks is undeniable, and part of what makes the visualization effective. But it’s also numbing. Human stories begin to resemble data points, presented in chronological, bulleted lists and physically collapsed into two-line previews. The very features that make narratives by enslaved people such powerful historical sources – detail, depth, emotional connection – fade away within this interface. Narratives give voice to the millions of individuals whose stories we’ll never hear; The Forced Migration of Enslaved People helps us to hear some of those voices, but only briefly, and only in passing.

ForcedMigration_Collapsed1

Historians characterize the years leading up to the Civil War as a period defined by sectional conflict between North and South. The abolition of slavery was not the major flashpoint for this conflict; rather, the expansion of slavery into western states and territories was the primary wedge between the two sides. The issue would come to define national politics by pitting two competing visions of the nation against one another. The Forced Migration of Enslaved People reminds us that this was not just an ideological or political issue, but a spatial issue rooted in the physical movement of hundreds of thousands of people into areas like the Black Belt and the Mississippi River Vally. By the 1850s, many northerners feared that this great heave of slaveholders and enslaved people would continue onwards into the Far West. The Forced Migration of Enslaved People forces us to take those fears seriously. What if the visualization’s red hexbins didn’t stop in the cotton fields of eastern Texas? What if its timeline didn’t end in 1860? Southern slavery did not stand still during the antebellum era and its demise was far from inevitable. This visualization gives us a framework with which to understand that trajectory.

I doubt that most Americans would put slave traders and shackled black bodies within the historical pantheon of great national migrations, but American Panorama injects this vast movement of people into the history of the antebellum United States. In the second part of my discussion, I’ll turn my attention to a much more familiar historical migration unfolding at the same time: The Overland Trails.

The Perpetual Sunrise of Methodology

[The following is the text of a talk I prepared for a panel discussion about authoring digital scholarship for history with Adeline Koh, Lauren Tilton, Yoni Appelbaum, and Ed Ayers at the 2015 American Historical Association Conference.]

 
I’d like to start with a blog post that was written almost seven years ago now, titled “Sunset for Ideology, Sunrise for Methodology?” In it, Tom Scheinfeldt argued that the rise of digital history represented a disciplinary shift away from big ideas about ideology or theory and towards a focus on “forging new tools, methods, materials, techniques, and modes or work.” Tom’s post was a big reason why I applied to graduate school. I found this methodological turn thrilling – the idea that tools like GIS, text mining, and network analysis could revolutionize how we study history. Seven years later the digital turn has, in fact, revolutionized how we study history. Public history has unequivocally led the charge, using innovative approaches to archiving, exhibiting, and presenting the past in order to engage a wider public. Other historians have built powerful digital tools, explored alternative publication models, and generated online resources to use in the classroom.
 
But there is one area in which digital history has lagged behind: academic scholarship. To be clear: I’m intentionally using “academic scholarship” in its traditional, hidebound sense of marshaling evidence to make original, explicit arguments. This is an artificial distinction in obvious ways. One of digital history’s major contributions has, in fact, been to expand the disciplinary definition of scholarship to include things like databases, tools, and archival projects. The scholarship tent has gotten bigger, and that’s a good thing. Nevertheless there is still an important place inside that tent for using digital methods specifically to advance scholarly claims and arguments about the past.
 
In terms of argument-driven scholarship, digital history has over-promised and under-delivered. It’s not that historians aren’t using digital tools to make new arguments about the past. It’s that there is a fundamental imbalance between the proliferation of digital history workshops, courses, grants, institutes, centers, and labs over the past decade, and the impact this has had in terms of generating scholarly claims and interpretations. The digital wave has crashed headlong into many corners of the discipline. Argument-driven scholarship has largely not been one of them.
 
There are many reasons for this imbalance, including the desire to reach a wider audience beyond the academy, the investment in collection and curation needed for electronic sources, or the open-ended nature of big digital projects. All of these are laudable. But there is another, more problematic, reason for the comparative inattention to scholarly arguments: digital historians have a love affair with methodology. We are infatuated with the power of digital tools and techniques to do things that humans cannot, such as dynamically mapping thousands of geo-historical data points. The argumentative payoffs of these methodologies are always just over the horizon, floating in the tantalizing ether of potential and possibility. At times we exhibit more interest in developing new methods than in applying them, and in touting the promise of digital history scholarship rather than its results. 
 
What I’m going to do in the remaining time is to use two examples from my own work to try and concretize this imbalance between methods and results. The first example is a blog post I wrote in 2010. At the time I was analyzing the diary of an eighteenth-century Maine midwife named Martha Ballard, made famous by Laurel Ulrich’s prize-winning A Midwife’s Tale. The blog post described how I used a process called topic modeling to analyze about 10,000 diary entries written by Martha Ballard between 1785 and 1812. To grossly oversimplify, topic modeling is a technique that automatically generates groups of words more likely to appear with each other in the same documents (in this case, diary entries). So, for instance, the technique grouped the following words together:
 
gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
 
As a human reader it’s pretty clear that these are words about gardeningOnce I generated this topic, I could track it across all 10,000 entries. When I mashed twenty-seven years together, it produced this beautiful thumbprint of a New England growing season.
 
Seasonal Presence of GARDENING topic in Martha Ballard’s Diary
 
Interest in topic modeling took off right around the time that I wrote this post, and pretty soon it started getting referenced again and again in digital humanities circles. Four and a half years later, it has been viewed more than ten thousand times and been assigned on the syllabi of at least twenty different courses. It’s gotten cited in books, journal articlesconference presentations, grant applications, government reports, white papers, and, of course, other blogs. It is, without a doubt, the single most widely read piece of historical writing I have ever produced. But guess what? Outside of the method, there isn’t anything new or revelatory in it. The post doesn’t make an original argument and it doesn’t further our understanding of women’s history, colonial New England, or the history of medicine. It largely shows us things we already know about the past – like the fact that people in Maine didn’t plant beans in January.
 
People seized on this blog post not because of its historical contributions, but because of its methodological contributions. It was like a magic trick, showing how topic modeling could ingest ten thousand diary entries and, in a matter of seconds, tell you what the major themes were in those entries and track them over time, all without knowing the meaning of a single word. The post made people excited for what topic modeling could do, not necessarily what it did do; the methodology’s potential, not its results.
 
About four years after I published my blog post on Martha Ballard, I published a very different piece of writing. This was an article that appeared in last June’s issue of the Journal of American History, the first digital history research article published by the journal. In many ways it was a traditional research article, one that followed the journal’s standard peer review process and advanced an original argument about American history. But the key distinction was that I made my argument using computational techniques. 
 
The starting premise for my argument was that the late nineteenth-century United States has typically been portrayed as a period of integration and incorporation. Think of the growth of railroad and telegraph networks, or the rise of massive corporations like Standard Oil. In nineteenth-century parlance: “the annihilation of time and space.” This existing interpretation of the period hinges on geography – the idea that the scale of locality and region were getting subsumed under the scale of nation and system. I was interested in how these integrative forces actually played out in the way people may have envisioned the geography of the nation. 
 
So I looked at a newspaper printed in Houston, Texas, during the 1890s and wrote a computer script that counted the number of times the paper mentioned different cities or states. In effect, how one newspaper crafted an imagined geography of the nation. What I found was that instead of creating a standardized, nationalized view of the world we might expect, the newspaper produced space in ways that centered on the scale of region far more than nation. It remained overwhelmingly focused on the immediate sphere of Texas, and even more surprisingly, on the American Midwest. Places like Kansas City, Chicago, and St. Louis were far more prevalent than I was expecting, and from this newspaper’s perspective Houston was more of a midwestern city than a southern one. 
 
Cameron Blevins, “Space, Nation, and the Triumph of Region: A View of the World from Houston,” Journal of American History, 101, no. 1 (June 2014), 127.
 
I would have never seen these patterns without a computer. And in trying to account for this pattern I realized that, while historians might enjoy reading stuff like this…
 
maine_zoom
 
…newspapers often look a lot more like this:
 
rr_timetable_crop
 
All of this really boring stuff – commodity prices, freight rates, railroad timetables, classified ads – made up a shockingly large percentage of content. Once you include the boring stuff, you get a much different view of the world from Houston in the 1890s. I ended up arguing that it was precisely this fragmentary, mundane, and overlooked content that explained the dominance of regional geography over national geography. I never would have been able to make this argument without a computer.
 
The article offers a new interpretation about the production of space and the relationship between region and nation. It issues a challenge to a long-standing historical narrative about integration and incorporation in the nineteenth-century United States. By publishing it in the Journal of American History, with all of the limitations of a traditional print journal, I was trying to reach a different audience from the one who read my blog post on topic modeling and Martha Ballard. I wanted to show a broader swath of historians that digital history was more than simply using technology for the sake of technology. Digital tools didn’t just have the potential to advance our understanding of American history – they actually did advance our understanding of American history.
 
To that end, I published an online component that charted the article’s digital approach and presented a series of interactive maps. But in emphasizing the methodology of my project I ended up shifting the focus away from its historical contributions. In the feedback and conversations I’ve had about the article since its publication, the vast majority of attention has focused on the method rather than the result: How did you select place-names? Why didn’t you differentiate between articles and advertisements? Can it be replicated for other sources? These are all important questions, but they skip right past the arguments that I’m making about the production of space in the late nineteenth century. In short: the method, not the result. 
 
I ended my article with a familiar clarion call:
Technology opens potentially transformative avenues for historical discovery, but without a stronger appetite for experimentation those opportunities will go unrealized. The future of the discipline rests in large part on integrating new methods with conventional ones to redefine the limits and possibilities of how we understand the past.
This is the rhetorical style of digital history. While reading through conference program I was struck by just how many abstracts about digital history used the words “potential,” “promise,” “possibilities,” or in the case of our own panel, “opportunities.” In some ways 2015 doesn’t feel that different from 2008, when Tom Scheinfeldt wrote about the sunrise of methodology and the Journal of American History published a roundtable titled “The Promise of Digital History.” I think this is telling. Academic scholarship’s engagement with digital history seems to operate in a perpetual future tense. I’ve spent a lot of my career talking about what digital methodology can do to advance scholarly arguments. It’s time to start talking in the present tense.

Making Numbers Legible

What do you do with numbers? I mean this in the context of writing, not research. How do you incorporate quantitative evidence into your writing in a way that makes it legible for your readers? I’ve been thinking more and more about this as I write my dissertation, which examines the role of the nineteenth-century Post in the American West. Much like today, the Post was massive. Its sheer size was part of what made it so important. And I find myself using the size of the Post to help answer the curmudgeonly “so what?” question that stalks the mental corridors of graduate students. On a very basic level, the Post mattered because so many Americans sent so many letters through such a large network operated by so many people. Answering the “so what?” question means that I have to incorporate numbers into my writing. But numbers are tricky.

Let’s begin with the amount of mail that moved through the U.S. Post. In 1880 Americans sent 1,053,252,876 letters. That number is barely legible for most readers. I mean this in two ways. In a mechanical sense we HATE having to actually read so many digits. A more conceptual problem is that this big of a number doesn’t mean all that much. If I change 1,053,252,876 to 1,253,252,876, would it lead you, the reader, to a fundamentally different conclusion about the size of the U.S. Post? I doubt it, even though the difference of 200 million letters is a pretty substantial one. And if instead of adding 200 million letters I subtract 200 million letters – 1,053,252,876 down to 853,252,876 – the reader’s perception is more likely to change. But this is only because the number shed one of its digits and crossed the magic cognitive threshold from “billion” to “million.” It’s not because of an inherent understanding of what those huge numbers actually mean.

ActualPerceived
Actual and perceived differences between 853,252,876 vs. 1,053,252,876 vs. 1,253,252,876

One strategy to make a number like 1,053,252,876 legible is by reduction: to turn large numbers into much smaller ones. If we spread out those billion letters across the population over the age of ten, the average American sent roughly twenty-eight letters over the course of 1880, or one every thirteen days. A ten-digit monstrosity turns into something the reader can relate to. After all, it’s easier to picture writing a letter every two weeks than it is to picture a mountain of one billion letters. Numbers, especially big ones, are easier to digest when they’re reduced to a more personal scale.

1,053,252,876 letters / 36,761,607 Americans over the age of ten = 28.65 letters / person

A second way to make numbers legible is by comparison. The most direct corollary to the U.S. Post was the telegraph industry. Put simply, the telegraph is a lot sexier than the Post and both nineteenth-century Americans and modern historians alike lionized the technology. A typical account goes something like this: “News no longer traveled at the excruciatingly slow pace of ships, horses, feet, or trains. It now moved at 670 million miles per hour.” In essence, “the telegraph liberated information.” But the telegraph only liberated information if you could afford to pay for it. In 1880 the cost of sending a telegram through Western Union from San Francisco to New York was $2.50, or 125 times the price to mail a two-cent letter. Not surprisingly, Americans sent roughly 35 times the number of letters than telegrams. The enormous size of the Post was in part a product of how cheap it was to use.

telegraphvspost
Cost of Telegram vs. Letter, San Francisco to New York (1880)

This points to a third strategy to make numbers legible: visualization. In the above case the chart acts as a rhetorical device. I’m less concerned with the reader being able to precisely measure the difference between $2.50 and $0.02 than I am with driving home the point that the telegraph was really, really expensive and the U.S. Post was really, really cheap. A more substantive comparison can be made by looking at the size of the Post Office Department’s workforce. In 1880 it employed an army of 56,421 postmasters, clerks, and contractors to process and transport the mail. Just how large was this workforce? In fact, the “postal army” was more than twice the size of the actual U.S. Army. Fifteen years removed from the Civil War there were now more postmasters than soldiers in American society. Readers are a lot better at visually comparing different bars than they are at doing mental arithmetic with large, unwieldy numbers.

PostOffice_Military

Almost as important as the sheer size of the U.S. Post was its geographic reach. Most postal employees worked in one of 43,012 post offices scattered across the United States. A liberal postal policy meant that almost any community could successfully petition the department for a new post office. Wherever people moved, a post office followed close on their heels. This resulted in a sprawling network that stretched from one corner of the country to the other. But what did the nation’s largest spatial network actually look like?

1880_PostOffices

Mapping 43,012 post offices gives the reader an instant sense for both the size and scope of the U.S. Post. The map serves an illustrative purpose rather than an argumentative one. I’m not offering interpretations of the network or even pointing out particular patterns. It’s simply a way for the reader to wrap their minds around the basic geography of such a vast spatial system. But the map is also a useful cautionary tale about visualizing numbers. If anything, the map undersells the size and extent of the Post. It may seem like a whole lot of data, but it’s actually missing around ten thousand post offices, or 22% of the total number that existed in 1880. Some of those offices were so obscure or had such a short existence that I wasn’t able to automatically find their locations. And these missing post offices aren’t evenly distributed: about 99% of Oregon’s post offices appear on the map compared to only 47% of Alabama’s.

Disclaimers aside, compare the map to a sentence I wrote earlier: “Most postal employees worked in one of 43,012 post offices scattered across the United States.” In that context the specific number 43,012 doesn’t make much of a difference – it could just as well be 38,519 or 51,933 – and therefore doesn’t contribute all that much weight to my broader point that the Post was ubiquitous in the nineteenth-century United States. A map of 43,012 post offices is much more effective at demonstrating my point. The map also has one additional advantage: it beckons the reader to not only appreciate the size and extent of the network, but to ask questions about its clusters and lines and blank spaces.* A map can spark curiosity and act as an invitation to keep reading. This kind of active engagement is a hallmark of good writing and one that’s hard to achieve using numbers alone. The first step is to make numbers legible. The second is to make them interesting.

* Most obviously: what’s going on with Oklahoma? Two things. Mostly it’s a data artifact – the geolocating program I wrote doesn’t handle Oklahoma locations very well, so I was only able to locate 19 out of 95 post offices. I’m planning to fix this problem at some point. But even if every post office appeared on the map, Oklahoma would still look barren compared to its neighbors. This is because Oklahoma was still Indian Territory in 1880. Mail service didn’t necessarily stop at its borders but postal coverage effectively fell off a cliff; in 1880 Indian Territory had fewer post offices than any other state/territory besides Wyoming. The dearth of post offices is especially telling given the ubiquity of the U.S. Post in the rest of the country, showing how the administrative status of the territory and decades of federal Indian policy directly shaped communications geography.

The County Problem in the West

Happy GIS Day! Below is a version of a lightning talk I’m giving today at Stanford’s GIS Day.

Historians of the American West have a county problem. It’s primarily one of geographic size: counties in the West are really, really big. A “List of the Largest Counties in the United States” might as well be titled “Counties in the Western United States (and a few others)” – you have to go all the way to #30 before you find one that falls east of the 100th meridian. The problem this poses to historians is that a lot of historical data was captured at a county level, including the U.S. Census.

521px-Map_of_California_highlighting_San_Bernardino_County.svg
San Bernardino County

San Bernardino County is famous for this – the nation’s largest county by geographic area, it includes the densely populated urban sprawl of the greater Los Angeles metropolis along with vast swathes of the uninhabited Mojave Desert. Assigning a single count of anything to San Bernardino county to is to teeter on geographic absurdity. But, for nineteenth-century population counts in the national census, that’s all we’ve got.

TheWest_1871_Population-01-01

Here’s a basic map of population figures from the 1870 census. You can see some general patterns: central California is by far the most heavily populated area, with some moderate settlement around Los Angeles, Portland, Salt Lake City, and Santa Fe. But for anything more detailed, it’s not terribly useful. What if there was a way to get a more fine-grained look at settlement patterns in these gigantic western counties? This is where my work on the postal system comes in. There was a post office in (almost) every nineteenth-century American town. And because the department kept records for all of these offices – the name of the office, its county and state, and the date it was established or discontinued – a post office becomes a useful proxy to study patterns over time and space. I assembled this data for a single year (1871) and then wrote a program to geocode each office, or to identify its location by looking it up in a large database of known place-names. I then supplemented it with the the salaries of postmasters at each office for 1871. From there, I could finally put it all onto a map:

TheWest_1871_PostOffices

The result is a much more detailed regional geography than that of the U.S. Census. Look at Wyoming in both maps. In 1870, the territory was divided into five giant rectangular counties, all of them containing less than 5,000 people. But its distribution of post offices paints a different picture: rather than vertical units, it consisted largely of a single horizontal stripe along its southern border.

Wyoming_census-02   Wyoming_postoffices-02

Similarly, our view of Utah changes from a population core of Salt Lake City to a line of settlement running down the center of the territory, with a cluster in the southwestern corner completely obscured in the census map.

Utah_census-01   Utah_postoffices-01

Post offices can also reveal transportation patterns: witness the clear skeletal arc of a stage-line that ran from the Oregon/Washington border southeast to Boise, Idaho.

Dalles_Boise

Connections that didn’t mirror the geographic unit of a state or county tended to get lost in the census. One instance of this was the major cross-border corridor running from central Colorado into New Mexico. A map of post offices illustrate its size and shape; the 1870 census map can only gesture vaguely at both.

ColoradoNewMexico_census-02   ColoradoNewMexico_postoffices-02

The following question, of course, should be asked of my (and any) map: what’s missing? Well, for one, a few dozen post offices. This speaks to the challenges of geocoding more than 1,300 historical post offices, many of which might have only been in existence for a single year or two. I used a database of more than 2 million U.S. place-names and wrote a program that tried to account for messy data (spelling variations, altered state or county boundaries, etc.). The program found locations for about 90% of post offices, while the remaining offices I had to locate by hand. Not surprisingly, they were missing from the database for a reason: these post offices were extremely obscure. Finding them entailed searching through county histories, genealogy message boards, and ghost town websites – a process that is simply not scalable beyond a single year. By 1880, the number of post offices in the West had doubled. By 1890, and it doubled again. I could conceivably spend years trying to locate all of these offices. So, what are the implications of incomplete data? Is automated, 90% accuracy “good enough”?

What else is missing? Differentiation. The salary of a postmaster partially addresses this problem, as the department used a formula to determine compensation based partially on the amount of business an office conducted. But it was not perfectly proportional. If it was, the map would be one giant circle covering everything: San Francisco conducted more business than any other office by several orders of magnitude. As it is, the map downplays urban centers while highlighting tiny rural offices. A post office operates in a kind of binary schema: no office, no people (well, at least very few). If there was an office, there were people there. We just don’t know how many. The map isn’t perfect, but it does start to tackle the county problem in the West.

*Note: You can download a CSV file containing post offices, postmaster salaries, and latitude/longitude coordinates here.*

Who Picked Up The Check?

Adventures in Data Exploration

In November 2012 the United States Postal Service reported a staggering deficit of $15.9 billion. For the historian, this begs the question: was it always this bad? Others have penned far more nuanced answers to this question, but my starting point is a lot less sophisticated: a table of yearly expenses and income.

SurplusDeficitByYear
US Postal Department Surplus (Gray) or Deficit (Red) by Year

So, was the postal department always in such terrible fiscal shape? No, not at first. But from the 1840s onward, putting aside the 1990s and early 2000s, deficits were the norm. The next question: What was the geography of deficits? Which states paid more than others? Essentially, who picked up the check?

Every year the Postmaster General issued a report containing a table of receipts and revenues broken down by state. Let’s take a look at 1871:

AnnualReportTableReceiptsExpenditruesByState
1871 Annual Report of the Postmaster General – Receipts and Expenditures

Because it’s only one table, I manually transcribed the columns into a spreadsheet. At this point, I could turn to ArcGIS to start analyzing the data, maybe merging the table with a shapefile of state boundaries provided by NHGIS. But ArcGIS is a relatively high-powered tool better geared for sophisticated geospatial analysis. What I’m doing doesn’t require all that much horsepower. And, in fact, quantitative spatial relationships (ex. measurements of distance or area) aren’t all that important for answering the questions I’ve posed. There are a number of different software packages for exploring data, but Tableau provides a quick-and-dirty, drag-and-drop interface. In keeping with the nature of data exploration, I’ve purposefully left the following visualizations rough around the edges. Below is a bar graph, for instance, showing the surplus or deficit of each state, grouped into rough geographic regions:

SurplusDeficitBar_Crop
Postal Surplus or Deficit by State – 1871

Or, in map form:

SurplusDeficitMap_Crop
Postal Surplus (Black) or Deficit (Red) by State – 1871

Between the map and the bar graph, it’s immediately apparent that:
a) Most states ran a deficit in 1871
b) The Northeast was the only region that emerged with a surplus

So who picked up the check? States with large urban, literate populations: New York, Pennsylvania, Massachusetts, Illinois. Who skipped out on the bill? The South and the West. But these are absolute figures. Maybe Texas and California simply spent more money than Arizona and Idaho because they had more people. So let’s normalize our data by analyzing it on a per-capita basis, using census data from 1870.

SurplusDeficitBar_PerCapita_Crop
Postal Surplus or Deficit per Person by State – 1871

The South and the West may have both skipped out on the bill, but it was the West that ordered prime rib and lobster before it left the table. Relative to the number of its inhabitants, western states bled the system dry. A new question emerges: how? What was causing this extreme imbalance of receipts and expenditures in the West? Were westerners simply not paying into the system?

ReceiptsExpendituresByRegion
Postal Receipts and Expenditures per Person by Region – 1871

Actually, no. The story was a bit more complicated. On a per-capita basis, westerners were paying slightly more money into the system than any other region. The problem was that providing service to each of those westerners cost substantially more than in any other region: $38 per person, or roughly 4-5 times the cost of service in the east. For all of its lore of rugged individualism and a mistrust of big government, the West received the most bloated government “hand-out” of any region in the country. This point has been driven home by a generation of “New Western” historians who demonstrated the region’s dependence on the federal government, ranging from massive railroad subsidies to the U.S. Army’s forcible removal of Indians and the opening of their lands to western settlers. Add the postal service to that long list of federal largesse in the West.

But what made mail service in the West so expensive? The original 1871 table further breaks down expenses by category (postmaster salaries, equipment, buildings, etc.). Some more mucking around in the data reveals a particular kind of expense that dominated the western mail system: transportation.

TransportationMap_PerCapita_Crop
Transportation Expenses per Person by State (State surplus in black, deficit in red) – 1871

High transport costs were partially a function of population density. Many western states like Idaho or Montana consisted of small, isolated communities connected by long mail routes. But there’s more to the story. Beginning in the 1870s, a series of scandals wracked the postal department over its “star” routes (designated as any non-steamboat, non-railroad mail route). A handful of “star” route carriers routinely inflated their contracts and defrauded the government of millions of dollars. These scandals culminated in the criminal trial of high-level postal officials, contractors, and a former United States Senator. In 1881, the New York Times printed a list of the ninety-three routes under investigation for fraud. Every single one of these routes lay west of the Mississippi.

1881_StarRouteFrauds_Crop
Annual Cost of “Star” Routes Under Investigation for Fraud – 1881 (Locations of Route Start/End Termini)

The rest of the country wasn’t just subsidizing the West. It was subsidizing a regional communications system steeped in fraud and corruption. The original question – “Who picked up the check?” – leads to a final cliffhanger: why did all of these frauds occur in the West?

Digital Humanities Labs and Undergraduate Education

Over the past few months I was lucky enough to do research in Stanford’s Spatial History Lab. Founded three years ago through funding from the Andrew Mellon Foundation, the lab was grown into a multi-faceted space for conducting different projects and initiatives dealing with spatial history. Having worked in the lab as a graduate affiliate over the past nine months as well, I can attest to what a fantastic environment it provides: computers, a range of software, wonderful staff, and an overarching collaborative setting. There are currently 6-8 ongoing projects in various stages at the lab under the direction of faculty and advanced graduate students, which focus on areas ranging from Brazil to Chile to the American West. Over ten weeks this summer, eight undergraduate research assistants worked under these projects. I had the opportunity to work alongside them from start to finish, and came away fully convinced of the potential for this kind of lab setting in furthering undergraduate humanities education.

The eight students ranged from freshman to the recently-graduated, who majored in everything from history to environmental studies to computer science. Some entered the program with technical experience of ArcGIS software; others had none. Each of them worked under an existing project and were expected to both perform traditional RA duties for the project’s director and also develop their own research agenda for the summer. Under this second track, they worked towards the end goal of producing an online publication for the website based on their own original research. Led by a carefully-planned curriculum, they each selected a topic within the first few weeks, conducted research during the bulk of the summer, went through a draft phase followed by a peer-review process, and rolled out a final publication and accompanying visualizations by the end of the ten weeks. Although not all of them reached the final point of publication at the end of that time, by the final tenth week each of them had produced a coherent historical argument or theme (which is often more than I can say about my own work).

The results were quite impressive, especially given the short time frame. For instance, rising fourth-year Michael DeGroot documented and analyzed the shifting national borders in Europe during World War II. Part of his analysis included a dynamic visualization that allows the reader to see major territorial changes between 1938-1945. DeGroot concludes that one major consequence of all of these shifts was the creation of a broadly ethnically homogenous states. In “Wildlife, Neoliberalism, and the Pursuit of Happiness,” Julio Mojica, a rising junior majoring in Anthropology and Science, Technology, and Society, analyzed survey data from the late twentieth-century on the island of Chiloé in order to examine links between low civic participation and environmental degradation. Mojica concludes that reliance on the booming salmon industry resulted in greater tolerance for pollution, a pattern that manifested itself more strongly in urban areas. As a final example, senior history major Cameron Ormsby studied late-19th century land speculation in Fresno County and impressively waded into a historiographical debate over the issue. Instead of speculators serving as necessary “middle-men” between small farmers and the state, Ormsby convincingly argues that they in fact handicapped the development of rural communities.

The success of the summer program speaks not only to the enthusiasm and quality of Stanford undergraduates, but more centrally to the direction of the lab and it’s overall working environment. By fostering an attitude of exploration, creativity, and collaboration, the students were not only encouraged, but expected to participate in projects as intellectual peers. The dynamic in the lab was not a traditional one of a faculty member dictating the agenda for the RA’s. In many cases, the students had far greater technical skills and knew more about their specific subjects than the project instructor. The program was structured to give the student’s flexibility and freedom to develop their own ideas, which placed the onus on them to take a personal stake in the wider projects. In doing so, they were exposed to the joys, challenges, and nitty-gritty details of digital humanities research: false starts and dead-ends were just as important as the pivotal, rewarding “aha!” moments that come with any project. Thinking back on internships or research assistant positions, it’s difficult for me to imagine another undergraduate setting that would encourage this kind of wonderfully productive hand-dirtying process. And while I think digital humanities labs hold great potential for advancing humanities scholarship, I have grown more and more convinced that some of their greatest potential lies in the realm of pedagogy.

Topic Modeling Martha Ballard’s Diary

In A Midwife’s Tale, Laurel Ulrich describes the challenge of analyzing Martha Ballard’s exhaustive diary, which records daily entries over the course of 27 years: “The problem is not that the diary is trivial but that it introduces more stories than can be easily recovered and absorbed.” (25) This fundamental challenge is the one I’ve tried to tackle by analyzing Ballard’s diary using text mining. There are advantages and disadvantages to such an approach – computers are very good at counting the instances of the word “God,” for instance, but less effective at recognizing that “the Author of all my Mercies” should be counted as well. The question remains, how does a reader (computer or human) recognize and conceptualize the recurrent themes that run through nearly 10,000 entries?

One answer lies in topic modeling, a method of computational linguistics that attempts to find words that frequently appear together within a text and then group them into clusters. I was introduced to topic modeling through a separate collaborative project that I’ve been working on under the direction of Matthew Jockers (who also recently topic-modeled posts from Day in the Life of Digital Humanities 2010). Matt, ever-generous and enthusiastic, helped me to install MALLET (Machine Learning for LanguagE ToolkiT), developed by Andrew McCallum at UMass as “a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.” MALLET allows you to feed in a series of text files, which the machine will then process and generate a user-specified number of word clusters it thinks are related topics. I don’t pretend to have a firm grasp on the inner statistical/computational plumbing of how MALLET produces these topics, but in the case of Martha Ballard’s diary, it worked. Beautifully.

With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:

  • MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient
  • CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt
  • DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn
  • GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds
  • SHOPPING: lb made brot bot tea butter sugar carried oz chees pork candles wheat store pr beef spirit churnd flower
  • ILLNESS: unwell mr sick gave dr rainy easier care head neighbor feet relief made throat poorly takeing medisin ts stomach

When I first ran the topic modeler, I was floored. A human being would intuitively lump words like attended, reverend, and worship together based on their meanings. But MALLET is completely unconcerned with the meaning of a word (which is fortunate, given the difficulty of teaching a computer that, in this text, discoarst actually means discoursed). Instead, the program is only concerned with how the words are used in the text, and specifically what words tend to be used similarly.

Besides a remarkably impressive ability to recognize cohesive topics, MALLET also allows us to track those topics across the text. With help from Matt and using the statistical package R, I generated a matrix with each row as a separate diary entry, each column as a separate topic, and each cell as a “score” signaling the relative presence of that topic. For instance, on November 28, 1795, Ballard attended the delivery of Timothy Page’s wife. Consequently, MALLET’s score for the MIDWIFERY topic jumps up significantly on that day. In essence, topic modeling accurately recognized, in a mere 55 words (many abbreviated into a jumbled shorthand), the dominant theme of that entry:

“Clear and pleasant. I am at mr Pages, had another fitt of ye Cramp, not So Severe as that ye night past. mrss Pages illness Came on at Evng and Shee was Deliverd at 11h of a Son which waid 12 lb. I tarried all night She was Some faint a little while after Delivery.”

The power of topic modeling really emerges when we examine thematic trends across the entire diary. As a simple barometer of its effectiveness, I used one of the generated topics that I labeled COLD WEATHER, which included words such as cold, windy, chilly, snowy, and air. When its entry scores are aggregated into months of the year, it shows exactly what one would expect over the course of a typical year:

Cold Weather

As a barometer, this made me a lot more confident in MALLET’s accuracy. From there, I looked at other topics. Two topics seemed to deal largely with HOUSEWORK:

1. house work clear knit wk home wool removd washing kinds pickt helping banking chips taxes picking cleaning pikt pails

2. home clear washt baked cloaths helped washing wash girls pies cleand things room bak kitchen ironed apple seller scolt

When charted over the course of the diary, these two topics trace how frequently Ballard mentions these kinds of daily tasks:

Housework

Both topics moved in tandem, with a high correlation coefficient of 0.83, and both steadily increased as she grew older (excepting a curious divergence in the last several years of the diary). This is somewhat counter-intuitive, as one would think the household responsibilities for an aging grandmother with a large family would decrease over time. Yet this pattern bolsters the argument made by Ulrich in A Midwife’s Tale, in which she points out that the first half of the diary was “written when her family’s productive power was at its height.” (285) As her children married and moved into different households, and her own husband experienced mounting legal and financial troubles, her daily burdens around the house increased. Topic modeling allows us to quantify and visualize this pattern, a pattern not immediately visible to a human reader.

Even more significantly, topic modeling allows us a glimpse not only into Martha’s tangible world (such as weather or housework topics), but also into her abstract world. One topic in particular leaped out at me:

feel husband unwel warm feeble felt god great fatagud fatagued thro life time year dear rose famely bu good

The most descriptive label I could assign this topic would be EMOTION – a tricky and elusive concept for humans to analyze, much less computers. Yet MALLET did a largely impressive job in identifying when Ballard was discussing her emotional state. How does this topic appear over the course of the diary?

Emotion

Like the housework topic, there is a broad increase over time. In this chart, the sharp changes are quite revealing. In particular, we see Martha more than double her use of EMOTION words between 1803 and 1804. What exactly was going on in her life at this time? Quite a bit. Her husband was imprisoned for debt and her son was indicted by a grand jury for fraud, causing a cascade effect on Martha’s own life – all of which Ulrich describes as “the family tumults of 1804-1805.” (285) Little wonder that Ballard increasingly invoked “God” or felt “fatagued” during this period.

I am absolutely intrigued by the potential for topic modeling in historic source material. In many ways, it seems that Martha Ballard’s diary is ideally suited for this kind of analysis. Short, content-driven entries that usually touch upon a limited number of topics appear to produce remarkably cohesive and accurate topics. In some cases (especially in the case of the EMOTION topic), MALLET did a better job of grouping words than a human reader. But the biggest advantage lies in its ability to extract unseen patterns in word usage. For instance, I would not have thought that the words “informed” or “hear” would cluster so strongly into the DEATH topic. But they do, and not only that, they do so more strongly within that topic than the words dead, expired, or departed. This speaks volumes about the spread of information – in Martha Ballard’s diary, death is largely written about in the context of news being disseminated through face-to-face interactions. When used in conjunction with traditional close reading of the diary and other forms of text mining (for instance, charting Ballard’s social network), topic modeling offers a new and valuable way of interpreting the source material.

I’ll end my post with a topic near and dear to Martha Ballard’s heart: her garden. To a greater degree than any other topic, GARDENING words boast incredible thematic cohesion (gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds) and over the course of the diary’s average year they also beautifully depict the fingerprint of Maine’s seasonal cycles:

Gardening

Note: this post is part of an ongoing series detailing my work on text mining Martha Ballard’s diary.

Chasing the “Perfect Data” Dragon

Whenever I put on my proselytizing robes to explain the potential of digital humanities to a layperson, I usually point towards the usual data deluge trope. “If you read a book a day for the rest of your life, it would take you 30-something lifetimes to read one million books. Google has already digitized several times that number.” etc. etc. The picture I end up painting is one where the DH community is better-positioned than traditional academics to access, manipulate, and draw out meaning from the growing mountains of digital data. Basically, now that all this information is digitized, we can feed the 1’s and 0’s into a machine and, presto, innovative scholarship.

Of course, my proselytizing is a bit disingenuous. The dirty little secret is that not all data is created equal. And especially within the humanist’s turf, most digitized sources are rarely “machine-ready”. The more projects I work on, the more and more convinced I become that there is one real constant to them: I always spend far more time than I expect preparing, cleaning, and improving my data. Why? Because I can.

A crucial advantage to digital information is that it’s dynamic and malleable. You can clean up a book’s XML tags, or tweak the coordinates of a georectified map, or expand the shorthand abbreviations in a digitized letter. Which is all well and good, but comes with a pricetag. In a way that is fundamentally different from the analog world, perfection is theoretically attainable. And that’s where an addictive element creeps into the picture. When you can see mistakes and know you can fix them, the temptation to both find and fix every single one is overwhelming.

In many respects, cleaning your data is absolutely crucial to good scholarship. The historian reading an 18th-century newspaper might know that “Gorge Washington” refers to the first president of the United States, but unless the spelling error gets fixed, that name probably won’t get identified correctly by a computer. Of course, it’s relatively easy to change “Gorge” to “George”, but what happens when you are working with 30,000 newspaper pages? Manually going through and fixing spelling mistakes (or, more likely, OCR mistakes) defeats the purpose and neuters the advantage of large-scale text mining. While there are ways to automate this kind of data cleaning, most methods are going to be surprisingly time-intensive. And once you start down the path of data cleaning, it can turn into whack-a-mole, with five “Thoms Jefferson”s poking their heads up out of the hole for every one “Gorge Washington” you fix.

Chasing the “perfect data” dragon becomes an addictive cycle, one fueled by equal parts optimism and fear. Having a set of flawlessly-encoded Gothic novels could very well lead to the next big breakthrough in genre classification. On the other hand, what if all those missed “Gorge Washingtons” are the final puzzle pieces that will illuminate early popular conceptions of presidential power? The problem is compounded by the fact that, in many cases, the specific errors can be fixed. But in breathlessly attempting to meet the “data deluge” problem, the number and kind of specific errors get multiplied by several orders of magnitude over increasingly larger and larger bodies of information and material – which severely complicates the ability to both locate and rectify all of them.

At some point, the digital material has to simply be “good enough”. But breaking out of the “perfect data” dragon-chasing is easier said than done. “How accurate does my dataset have to be to in order to be statistically relevant?” “How do I even know how clean my data actually is?” “How many hours of my time is it worth to bump up the data accuracy from 96% to 98%?” These are the kinds of questions that DH researchers suddenly struggle with – questions that a background in the humanities ill-prepares them to answer. Just like so many aspects of doing this kind of work, there is a lot to learn from other disciplines.

Certain kinds of data quality issues get mitigated by the “safety in numbers” approach. Pinpointing the exact cross-streets of a rail depot is pretty important if you’re creating a map of a small city. But if you’re looking at all the rail depots in, say, the Midwest, the “good enough” degree of locational error gets substantially bigger. Over the course of thirty million words, the number of “George Washingtons” are going to far outweigh and balance out the number of “Gorge Washingtons”. With large-scale digital projects, it’s easier to see that chasing the “perfect data” dragon is both impossible and unnecessary. On the other hand, certain kinds of data quality problems get magnified with a larger scale. Small discrepancies get flattened out with bigger datasets. But foundational or commonly-repeated errors get exaggerated with a larger dataset, particularly if some errors have been fixed and others not. For instance, if you fixed every “Gorge Washington” but didn’t catch the more frequently misspelled “Thoms Jefferson”, comparing the textual appearances of the two presidents over those thirty million words is going to be heavily skewed in George’s direction.

As non-humanities scholars have been demonstrating for years, these problems aren’t new and they aren’t unmanageable. But as digital humanists sort through larger and larger sets of data, it will become increasingly important to know when to ignore the dragon and when to give chase.

Valley of the Shadow and the Digital Database

Since its inception as a website in the early 1990s, the digital history project Valley of the Shadow has received awards from the American Historical Association, been profiled in Wired Magazine, and termed a “milestone in American historiography” in Reviews in American History. The project is also widely regarded as one of the principal pioneers within the rough-and-tumble wilderness of early digital history.1 Conceived at the University of Virginia as the brainchild of Edward Ayers (historian of the American South and now president of University of Richmond), the project examines two communities, one Northern and one Southern, in the Shenandoah Valley during the American Civil War. The initiative documented and digitized thousands upon thousands of primary source materials from Franklin County, Pennsylvania and Augusta County, Virginia, including letters, diaries, newspapers, speeches, census and government records, maps, images, and church records.

By any measure, Valley of the Shadow has been a phenomenal success. Over the course of a decade and a half, it has provided the catalyst for a host of books, essays, CD-ROM’s, teaching aids, and articles – not to mention more than a few careers. At times it seems that everyone and their mother in the digital history world has some kind of connection to Valley of the Shadow. The impact the project has had, both within and outside of the academy, is a bit overwhelming. In this light, I decided to revisit Valley of the Shadow with a more critical lens and examine how it has held up over the years.

At the bottom of the Valley‘s portal, it reads “Copyright 1993-2007.” There aren’t many academic sites that can claim that kind of longevity, but this also carries a price. In short, the website already feels a bit dated. The structure of the website is linear, vertical, and tree-like. The parent portal opens up into a choice between three separated sections: The Eve of War (Fall 1859 – Spring 1861), The War Years (Spring 1861 – Spring 1865), and The Aftermath (Spring 1865 – Spring 1870). Each of these are divided into different repositories of source material, from church records to tax and census data to battle maps. Clicking on a repository leads to different links (for instance, two links leading to the two counties’ letters). A few more clicks can lead to, say, a letter from Benjamin Franklin Cochran to his mother in which he leads off with the delicious detail of lived experience that historians love: “I am now writing on a bucket turned wrong side up.”

In this sense, the database is geared towards a vertical experience, in which users “drill down” (largely through hyperlinks) to reach a fine-grained level of detail: Portal -> Time Period -> Source Material Type -> County -> Letter. What this approach lacks is the kind of flexible, horizontal experience that has become a hallmark of today’s online user experience. If one wanted to jump from Cochran’s letter to see, for instance, battle maps of the skirmishes he was referencing or if local newspapers described any of the events he wrote about, the process is disjointed, requiring the user to “drill up” to the appropriate level and then “drill down” again to find battle maps or newspapers. This emphasis on verticality is largely due to the partitioned nature of the website, divided as it is into so many boxed categories. This makes finding a specific source a bit easier, but restricts the exploratory ability of a user to cross boundaries between the sites’ different eras, geography, and source types.

If different sections of the website are partitioned from one another, what kind of options exist for opening the database itself beyond the websites own walls? In October of 2009, NiCHE held a conference on Application Programming Interfaces (APIs) for the Digital Humanities, with the problem it was tackling outlined as follows:

To date, however, most of these resources have been developed with human-friendly web interfaces. This makes it easy for individual researchers to access material from one site at a time, while hindering the kind of machine-to-machine exchange that is required for record linkage across repositories, text and data mining initiatives, geospatial analysis, advanced visualization, or social computing.

This description highlights the major weakness of Valley of the Shadow: its (relative) lack of interactiveness and interoperability. A human researcher can access specific information from the website, but it remains a major challenge to employ more advanced digital research techniques on that information. Every database is inherently incomplete. But one way to mitigate this problem is to open up the contents of a database beyond the confines of the database itself. The following scenario might fall under the “pipe-dream” category, but it illustrates the potential for an online database: a researcher writes a programming script to pull out every letter in Valley of the Shadow written by John Taggart, search both the Valley‘s database and national census records in order to identify the letters’s recipients, capture each household’s location and income level, and use that data to plot Taggart’s social world on a geo-referenced historical map or in a noded social network visualization. Again, this might be a pipe-dream, but it does highlight the possibilities for opening up Valley of the Shadow‘s phenomenally rich historical content into a more interactive and interoperable database.

At the end of the day, Valley of the Shadow deserves every ounce of acclaim it has received. Beyond making a staggering array of primary sources available and accessible to researchers, educators, and students, it helped pave the way for the current generation of digital humanists. Valley of the Shadow embodies many of the tenets of this kind of scholarship: multi-modal, innovative, and most importantly, collaborative. Its longevity and success speaks to the potential of digital history projects, and should continue to serve as a resource and model moving forward.


1 I, for one, imagine the early days of digital history to be a rough-and-tumble wilderness, resplendent with modem-wrangling Mosaic cowboys and Usenet bandits.