Topic Modeling and the Margaret Sanger Papers

Cathy Hajo, my former advisor at the Margaret Sanger Papers at NYU recently wrote a fascinating blog post on topic modeling and its applicability to the humanities. It’s exciting to learn about other scholars’ engagement with digital technologies! Maybe we could bring some of these folks to Duke to participate in whatever Digital History Speakers series we put together next year!

With Cathy’s permission, I have re-blogged her piece:

Margaret Sanger

I recently attended the Women’s History in the Digital World conference, sponsored by Bryn Mawr College’s Albert M. Greenfield Digital Center for the History of Women’s Education. The sessions were packed with great papers and projects, many of which started the wheels turning on different ways that we might use digital research tools to better understand Sanger and her ideas.

In the very first panel I attended, Bridget Baird of Connecticut College and Cameron Blevins of Stanford University, talked about topic modeling, the process of using a computer program to mine digital texts and build sets of words that frequently appear together. Their work compared the diaries of Martha Ballard and the Elizabeth Drinker. The women lived about a century apart and in very different conditions, so there was an expectation that their diaries would describe very different lives. The sample comparisons shown at the panel demonstrated both similarity in word usage and contrasts that reflected differences in social class, location, and time period.

A visualization of gardening terms by month in the Ballard diary.

What topic modeling can offer a historian is an objective snapshot of the content of the collection.  Rather than relying on our own readings of documents to combine them together into subject categories, we look instead to the words that appear together most frequently and then label those words in ways that make sense to us.  In the case of Martha Ballard, one cluster of words (birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient) clearly related to her profession as a midwife. Others regarding gardening (see image above), fall into predictable seasonal patterns. Still other groupings of words are less easy to label, and some may not at first make any cohesive sense. Yet, we can study the frequencies with which certain groups of words occur.

We cannot rely only on the computer-driven groups to use in analyzing texts.  The next step is to look at the texts that contain repeating word patterns and conduct a close reading to see what we can learn about the topic. Plotting the topic over time enables us to locate trends in how important the topic was to the author, or when we compare them with other authors, we can investigate differences in the ways that two authors valued these topics or the different ways that they expressed themselves.

An example from the Ballard study is instructive, as Cameron Blevin discussed in his blog:

. . . topic modeling allows us a glimpse not only into Martha’s tangible world (such as weather or housework topics), but also into her abstract world. One topic in particular leaped out at me:

feel husband unwel warm feeble felt god great fatagud fatagued thro life time year dear rose famely bu good

The most descriptive label I could assign this topic would be EMOTION – a tricky and elusive concept for humans to analyze, much less computers. Yet MALLET did a largely impressive job in identifying when Ballard was discussing her emotional state. How does this topic appear over the course of the diary?


Like the housework topic, there is a broad increase over time. In this chart, the sharp changes are quite revealing. In particular, we see Martha more than double her use of EMOTION words between 1803 and 1804. What exactly was going on in her life at this time? Quite a bit. Her husband was imprisoned for debt and her son was indicted by a grand jury for fraud, causing a cascade effect on Martha’s own life – all of which Ulrich describes as “the family tumults of 1804-1805.” (285) Little wonder that Ballard increasingly invoked “God” or felt “fatagued” during this period.

Adopting topic modeling tools for the Sanger Papers’ Speeches and Articles project will be interesting as we have already spent a lot of time developing and affixing detailed subject terms to the texts in order to provide additional ways to search and display them. When you have over 600 speeches and articles, the vast majority of which discuss birth control, the trick is uncovering subtle differences between and among them. We create detailed index entries for each text in the edition, narrowing the focus in so that our readers can use the subjects to cut through the documents to find the best ones on a specific issue. Topic modeling can offer us some new groupings of documents that we might have overlooked, and it will give us the capacity to analyze Sanger’s rhetoric over time, looking for key changes.

An example might be the belief among women’s historians that Sanger abandoned her feminist rationales for birth control in the late 1910s and early 1920s as she sought support from experts in the fields of medicine, social work and eugenics. This comes from a qualitative reading of Sanger’s writings, not a strict quantitative one. If we can identify a cluster of words as “feminist,” we can then trace how frequently those words appeared in Sanger’s writings and whether the findings match our assumptions.

Will we find clusters of words we can describe with terms like “feminism,” “eugenics,” or “reproductive health”? What words will we find clumped with “abortion” or with “birth control”? Will we be able to trace these clusters over time to see how they change over the course of Sanger’s life? Interesting questions, and ones that we hope to be able to ask our digital edition.

Now just to find a programmer to work with!


“An Avalache is Coming”

This is a essay that is floating around right now in the context of Duke’s decision to join the an consortium of schools that will accept credit for undergraduate online courses offered by one of the member schools. More on that soon.

The essay, called “An avalanche is coming” seems a little inflammatory to me. One of the promotional quotes on the website is:

‘Our belief is that deep, radical and urgent transformation is required in higher education as much as it is in school systems. Our fear is that, perhaps as a result of complacency, caution or anxiety, or a combination of all three, the pace of change is too slow and the nature of change too incremental.’

I completely agree that the environment we are working in is changing, but the tone of the piece is trying to bully people.  It’s basic stance is that even though we haven’t spent much time thinking about the short and long term consequences, everyone should jump onboard with their agenda or else they will be wiped out. What’s the rush? What’s gonna happen if we don’t completely transform academic pedagogy immediately? Have we really been suffering so much before now? Most the leaders in this debate are themselves products of traditional elite university education. Have they been handicapped because of it? Can we talk about the digital turn without insulting people who still value the engaged pedagogy of a liberal education?

One of the big complaints in the piece is that the student consumer is “king” now and they rule with their money. But students aren’t getting the most for their money in universities that spend a bunch on research and influencial scholars, since that supposedly doesn’t influence their learning. The focus needs to be on good teaching. So instead of funding more professors so the class sizes can be smaller, they argue that we should fund less professors and have students all over the world learn online from a few good teachers.

Why is Duke rushing to join this online courses consortium without running it through the traditional channels of faculty governance? Are they afraid that the faculty will shut the project down? Someone suggested to me today that Duke is rushing because they want to become one the dominant schools that can then sell it’s classes to smaller and poorer universities, whose junior faculty and adjuncts will be out of luck.

I’m looking forward to our meeting this week so I can hear ya’lls thoughts on these issues.

A few thoughts on Evernote

The first event in our “Digital Tools Bootcamp” series was a success.
I was really pleased that I was able to share Evernote with people who
hadn’t already been introduced to it. Evernote, for those of you that
don’t know, is a cloud-based  data organizing program. For the free
version there is a monthly upload limit, but I don’t think I have ever
come close to reaching it. You have to be connected to the internet to
use it, but you can upgrade to the pay version if you plan on needing
it offline.

The organizing principle of Evernote is “notes” (text, images, pdfs,
audio) and you organized into “notebooks” (Anthro class, Haiti
research, wedding planning). These can organized further into
“notebook stacks” (school, event planning, teaching and research).
Right now I use it mostly for keeping track of all my class notes and
readings, as well as all the events and projects I am working on. The
whole database and as well as individual notebooks are searchable, and
you can also use tags to label individual notes and search them that
way as well.

One of the coolest and most innovative features of Evernote, in my
opinion, is the seamless way a “web clipper” integrates into my
browser (I use Chrome). When I am doing research, looking at recipes,
or whatever, and I find something that I want to save and keep for
later, I can click the Evernote add-on in my browser and a little
window pops up, allowing me to select either a selection, the article,
or the entire full page the website, and decide which notebook I want
to store it in. Then the webpage is stored in my Evernote account.
Even if the original website is removed, the clipping in my account
will remain.

I hope you get a chance to explore Evernote and decide if it is a
useful tool for you. I know it has made a big difference in the way I
organize my notes and my research.

Transcriva Review

I’ve been getting increasingly interested in oral history over the past few years, and I recently discovered a program that makes working with oral histories SO MUCH EASIER! It’s called Transcriva. I’m amazed that I ever tried to use any other system for listening to and transcribing, annotating, or taking notes on interviews . I was a little wary at first, because it costs $30.00, but they let you play around with the very limited free version and I decided to spring for it because I was so frustrated with the inadequacies of the  system I was using before (word processor + itunes = no fun). I’m very glad I did. Disclaimer: I think it might only be available for Macs.

click on the image to enlarge

It’s not a complicated program, so I was able to start using it quickly. To begin with, you link the transcript file to a audio file on your computer or hard drive (or online, but I haven’t tried that yet). Then you can assign different speakers, who each have their own shortcut keys to facilitate switching between speakers. You make new entries in the transcript by pressing “enter” or by using a new speakers’ shortcut key. There is also short cut keys to move backward or forward or to pause the audio. You can slow down or speed up the listening speed. And then, once you have transcribed or annotated your stuff, you can easily jump to different points in the interview in both the audio or the text. Over all it just streamlines the whole process and makes it much quicker, more intuitive and more manageable. I don’t dread transcription as much as I used to. It’s even kind of fun now.

There are other features I haven’t taken advantage of yet. Apparently you can record audio files directly into Transcriva, and you can link your transcripts to online content as well. You can also export the transcripts as text files, perserving the speakers and the timestamps of your notes.

In sum, if you are working with audio files or interviews for your research  I would strongly suggest you check out Transcriva!


Visual Complexity

I’m at a PhD Lab event and the speaker, Ann Pendleton-Jullian, just mentioned this awesome looking website, Visual Complexity.

It’s a site about different ways to visualize data. Its about how you frame what you think is most important in the information. Think about that famous information map that links the temperature to Napoleons retreat in Russia.

Journal of Digital Humanities?

Whoa! So I googled “digital humanities” to find cool stuff to populate our new blog, but I was not expecting to find this! It’s the Journal of Digital Humanities, a peer-reviewed open access journal that is on its fourth issue! I think it could be a potential source of material for us to think with, with articles like “Academic History Writing and its Disconnects“. This article is touching on a bunch of issues we talked about in our last meeting, like the death of books and the possibilities and limitations of OCR for historians. Here is a little excerpt:

 At the same time we are confronted by a profound intellectual challenge that addresses the very nature of the historical discipline. This transition from the ‘book’ to something new fundamentally undercuts what historians do more generally. When one starts to unpick the nature of the historical discipline it is tied up with the technologies of the printed page and the book in ways that are powerful and determining. Footnotes, post-Rankean cross referencing, and the practises of textual analysis are embedded within the technology of the book, and its library.

Digital Timepiece: a note on the header image

I was searching on internet for images that might be appropriate headers for our Digital History Working Group website, and I came across this page. It is an online book called Digital Timepiece by Nathaniel Haefner, which is organized around different ways of thinking about time and storytelling. I’m happy to change it if someone has an image they prefer. But it is an appropriate beginning to our endeavor, no?