Cathy Hajo, my former advisor at the Margaret Sanger Papers at NYU recently wrote a fascinating blog post on topic modeling and its applicability to the humanities. It’s exciting to learn about other scholars’ engagement with digital technologies! Maybe we could bring some of these folks to Duke to participate in whatever Digital History Speakers series we put together next year!
With Cathy’s permission, I have re-blogged her piece:
I recently attended the Women’s History in the Digital World conference, sponsored by Bryn Mawr College’s Albert M. Greenfield Digital Center for the History of Women’s Education. The sessions were packed with great papers and projects, many of which started the wheels turning on different ways that we might use digital research tools to better understand Sanger and her ideas.
In the very first panel I attended, Bridget Baird of Connecticut College and Cameron Blevins of Stanford University, talked about topic modeling, the process of using a computer program to mine digital texts and build sets of words that frequently appear together. Their work compared the diaries of Martha Ballard and the Elizabeth Drinker. The women lived about a century apart and in very different conditions, so there was an expectation that their diaries would describe very different lives. The sample comparisons shown at the panel demonstrated both similarity in word usage and contrasts that reflected differences in social class, location, and time period.
What topic modeling can offer a historian is an objective snapshot of the content of the collection. Rather than relying on our own readings of documents to combine them together into subject categories, we look instead to the words that appear together most frequently and then label those words in ways that make sense to us. In the case of Martha Ballard, one cluster of words (birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient) clearly related to her profession as a midwife. Others regarding gardening (see image above), fall into predictable seasonal patterns. Still other groupings of words are less easy to label, and some may not at first make any cohesive sense. Yet, we can study the frequencies with which certain groups of words occur.
We cannot rely only on the computer-driven groups to use in analyzing texts. The next step is to look at the texts that contain repeating word patterns and conduct a close reading to see what we can learn about the topic. Plotting the topic over time enables us to locate trends in how important the topic was to the author, or when we compare them with other authors, we can investigate differences in the ways that two authors valued these topics or the different ways that they expressed themselves.
An example from the Ballard study is instructive, as Cameron Blevin discussed in his blog:
. . . topic modeling allows us a glimpse not only into Martha’s tangible world (such as weather or housework topics), but also into her abstract world. One topic in particular leaped out at me:
feel husband unwel warm feeble felt god great fatagud fatagued thro life time year dear rose famely bu good
The most descriptive label I could assign this topic would be EMOTION – a tricky and elusive concept for humans to analyze, much less computers. Yet MALLET did a largely impressive job in identifying when Ballard was discussing her emotional state. How does this topic appear over the course of the diary?
Like the housework topic, there is a broad increase over time. In this chart, the sharp changes are quite revealing. In particular, we see Martha more than double her use of EMOTION words between 1803 and 1804. What exactly was going on in her life at this time? Quite a bit. Her husband was imprisoned for debt and her son was indicted by a grand jury for fraud, causing a cascade effect on Martha’s own life – all of which Ulrich describes as “the family tumults of 1804-1805.” (285) Little wonder that Ballard increasingly invoked “God” or felt “fatagued” during this period.
Adopting topic modeling tools for the Sanger Papers’ Speeches and Articles project will be interesting as we have already spent a lot of time developing and affixing detailed subject terms to the texts in order to provide additional ways to search and display them. When you have over 600 speeches and articles, the vast majority of which discuss birth control, the trick is uncovering subtle differences between and among them. We create detailed index entries for each text in the edition, narrowing the focus in so that our readers can use the subjects to cut through the documents to find the best ones on a specific issue. Topic modeling can offer us some new groupings of documents that we might have overlooked, and it will give us the capacity to analyze Sanger’s rhetoric over time, looking for key changes.
An example might be the belief among women’s historians that Sanger abandoned her feminist rationales for birth control in the late 1910s and early 1920s as she sought support from experts in the fields of medicine, social work and eugenics. This comes from a qualitative reading of Sanger’s writings, not a strict quantitative one. If we can identify a cluster of words as “feminist,” we can then trace how frequently those words appeared in Sanger’s writings and whether the findings match our assumptions.
Will we find clusters of words we can describe with terms like “feminism,” “eugenics,” or “reproductive health”? What words will we find clumped with “abortion” or with “birth control”? Will we be able to trace these clusters over time to see how they change over the course of Sanger’s life? Interesting questions, and ones that we hope to be able to ask our digital edition.
Now just to find a programmer to work with!