Topic Modeling and the Margaret Sanger Papers

Cathy Hajo, my former advisor at the Margaret Sanger Papers at NYU recently wrote a fascinating blog post on topic modeling and its applicability to the humanities. It’s exciting to learn about other scholars’ engagement with digital technologies! Maybe we could bring some of these folks to Duke to participate in whatever Digital History Speakers series we put together next year!

With Cathy’s permission, I have re-blogged her piece:

Margaret Sanger

I recently attended the Women’s History in the Digital World conference, sponsored by Bryn Mawr College’s Albert M. Greenfield Digital Center for the History of Women’s Education. The sessions were packed with great papers and projects, many of which started the wheels turning on different ways that we might use digital research tools to better understand Sanger and her ideas.

In the very first panel I attended, Bridget Baird of Connecticut College and Cameron Blevins of Stanford University, talked about topic modeling, the process of using a computer program to mine digital texts and build sets of words that frequently appear together. Their work compared the diaries of Martha Ballard and the Elizabeth Drinker. The women lived about a century apart and in very different conditions, so there was an expectation that their diaries would describe very different lives. The sample comparisons shown at the panel demonstrated both similarity in word usage and contrasts that reflected differences in social class, location, and time period.

A visualization of gardening terms by month in the Ballard diary.

What topic modeling can offer a historian is an objective snapshot of the content of the collection.  Rather than relying on our own readings of documents to combine them together into subject categories, we look instead to the words that appear together most frequently and then label those words in ways that make sense to us.  In the case of Martha Ballard, one cluster of words (birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient) clearly related to her profession as a midwife. Others regarding gardening (see image above), fall into predictable seasonal patterns. Still other groupings of words are less easy to label, and some may not at first make any cohesive sense. Yet, we can study the frequencies with which certain groups of words occur.

We cannot rely only on the computer-driven groups to use in analyzing texts.  The next step is to look at the texts that contain repeating word patterns and conduct a close reading to see what we can learn about the topic. Plotting the topic over time enables us to locate trends in how important the topic was to the author, or when we compare them with other authors, we can investigate differences in the ways that two authors valued these topics or the different ways that they expressed themselves.

An example from the Ballard study is instructive, as Cameron Blevin discussed in his blog:

. . . topic modeling allows us a glimpse not only into Martha’s tangible world (such as weather or housework topics), but also into her abstract world. One topic in particular leaped out at me:

feel husband unwel warm feeble felt god great fatagud fatagued thro life time year dear rose famely bu good

The most descriptive label I could assign this topic would be EMOTION – a tricky and elusive concept for humans to analyze, much less computers. Yet MALLET did a largely impressive job in identifying when Ballard was discussing her emotional state. How does this topic appear over the course of the diary?


Like the housework topic, there is a broad increase over time. In this chart, the sharp changes are quite revealing. In particular, we see Martha more than double her use of EMOTION words between 1803 and 1804. What exactly was going on in her life at this time? Quite a bit. Her husband was imprisoned for debt and her son was indicted by a grand jury for fraud, causing a cascade effect on Martha’s own life – all of which Ulrich describes as “the family tumults of 1804-1805.” (285) Little wonder that Ballard increasingly invoked “God” or felt “fatagued” during this period.

Adopting topic modeling tools for the Sanger Papers’ Speeches and Articles project will be interesting as we have already spent a lot of time developing and affixing detailed subject terms to the texts in order to provide additional ways to search and display them. When you have over 600 speeches and articles, the vast majority of which discuss birth control, the trick is uncovering subtle differences between and among them. We create detailed index entries for each text in the edition, narrowing the focus in so that our readers can use the subjects to cut through the documents to find the best ones on a specific issue. Topic modeling can offer us some new groupings of documents that we might have overlooked, and it will give us the capacity to analyze Sanger’s rhetoric over time, looking for key changes.

An example might be the belief among women’s historians that Sanger abandoned her feminist rationales for birth control in the late 1910s and early 1920s as she sought support from experts in the fields of medicine, social work and eugenics. This comes from a qualitative reading of Sanger’s writings, not a strict quantitative one. If we can identify a cluster of words as “feminist,” we can then trace how frequently those words appeared in Sanger’s writings and whether the findings match our assumptions.

Will we find clusters of words we can describe with terms like “feminism,” “eugenics,” or “reproductive health”? What words will we find clumped with “abortion” or with “birth control”? Will we be able to trace these clusters over time to see how they change over the course of Sanger’s life? Interesting questions, and ones that we hope to be able to ask our digital edition.

Now just to find a programmer to work with!


DEVONThink Bootcamp

At our first digital tools Bootcamp, I shared some of the program’s functions and how I use it to organize my dissertation research. The audience had a few people who already used Evernote and/or DEVONThink, so we had a great exchange about what these programs are useful for as well as their limitations.

The primary advantage to DEVONThink is that you can compile, view, tag, organize and search any kind of document in one “database.” So, if you have project with photographs of archival documents, video, articles clipped from the web, audio files, pfds, word documents—anything—it can be imported into your database. You can also create annotations or notes that attach to files, or create a new file (a text document) within DEVONThink. Like Evernote (though in my opinion, not as seamlessly or intuitively), you can clip things from the web as you surf. DEVONThink also has some “intelligent” functions that help you find related terms and files.

After a summer research trip to archives in Alaska, I had tens of thousands of photographs and no manageable way to deal with them. I imported them all to DEVONThink, renamed the files, merged files that were of a single source, and began the process of organizing them in a way that makes sense for my project. I take notes on sources within DEVONThink, and even make citations for each source as I go.

In the discussion, we talked about the tension between being able to amass material and being able to meaningfully navigate it. People asked specific questions about how to do things, and we compared they ways we use the program. From what people shared, it seems DEVONThink is pretty adaptable. You can customize it to fit the needs of your sources, discipline, and way of writing and researching.


“An Avalache is Coming”

This is a essay that is floating around right now in the context of Duke’s decision to join the an consortium of schools that will accept credit for undergraduate online courses offered by one of the member schools. More on that soon.

The essay, called “An avalanche is coming” seems a little inflammatory to me. One of the promotional quotes on the website is:

‘Our belief is that deep, radical and urgent transformation is required in higher education as much as it is in school systems. Our fear is that, perhaps as a result of complacency, caution or anxiety, or a combination of all three, the pace of change is too slow and the nature of change too incremental.’

I completely agree that the environment we are working in is changing, but the tone of the piece is trying to bully people.  It’s basic stance is that even though we haven’t spent much time thinking about the short and long term consequences, everyone should jump onboard with their agenda or else they will be wiped out. What’s the rush? What’s gonna happen if we don’t completely transform academic pedagogy immediately? Have we really been suffering so much before now? Most the leaders in this debate are themselves products of traditional elite university education. Have they been handicapped because of it? Can we talk about the digital turn without insulting people who still value the engaged pedagogy of a liberal education?

One of the big complaints in the piece is that the student consumer is “king” now and they rule with their money. But students aren’t getting the most for their money in universities that spend a bunch on research and influencial scholars, since that supposedly doesn’t influence their learning. The focus needs to be on good teaching. So instead of funding more professors so the class sizes can be smaller, they argue that we should fund less professors and have students all over the world learn online from a few good teachers.

Why is Duke rushing to join this online courses consortium without running it through the traditional channels of faculty governance? Are they afraid that the faculty will shut the project down? Someone suggested to me today that Duke is rushing because they want to become one the dominant schools that can then sell it’s classes to smaller and poorer universities, whose junior faculty and adjuncts will be out of luck.

I’m looking forward to our meeting this week so I can hear ya’lls thoughts on these issues.

Taming the Elephant

Our workshop on Evernote and DEVONthink is going really well!  I came across this interesting article on how to use the basic elements of Evernote to suit specific goals for the program etc: “Taming the Elephant.”

“Out of the box, Evernote comes with some pretty robust syncing tools for all your note-taking needs. If you haven’t dug in to all Evernote can do, though, you might not be aware of everything on offer or just how well you can integrate Evernote into your workflow. From automation to advanced searches, we’re going to make Evernote start working harder for you.”


March Madness, Facebook data mining, and Mapping

A good friend of mine sent me this article about a creative use of Facebook data mining and visualization of basketball fan loyalties (a very appropriate project in the midst of March Madness).  Arguably, projects like this visualization of “fandom” are a good way to introduce students to the concepts of data mining and mapping, putting into motion the cogs and coils of inspiration for mapping projects related to change over time.  First step: Facebook and fandom.  Second step: Wikipedia and the historical record. Third step: independent research project.  The possibilities are endless!

Check out this image of Duke and North Carolina loyalties:

Visual Complexity

I’m at a PhD Lab event and the speaker, Ann Pendleton-Jullian, just mentioned this awesome looking website, Visual Complexity.

It’s a site about different ways to visualize data. Its about how you frame what you think is most important in the information. Think about that famous information map that links the temperature to Napoleons retreat in Russia.

Miscellaneous, Tumblr

This media studies scholar at UC Irvine has posted her published work on Tumblr. I am wondering about the legality of that, what kind of agreements you have with the periodicals you get published in:

Apparently “Tumblr feminism” is a thing:

My tumblr is gpaigewelch, and my course tumblr is writing-101genderandspace

I have also been thinking about how much work it is to maintain a social media presence (let alone a compelling one that meets the demands and social codes of each platform) and how that does or does not contribute to what counts as our professional work.

Journal of Digital Humanities?

Whoa! So I googled “digital humanities” to find cool stuff to populate our new blog, but I was not expecting to find this! It’s the Journal of Digital Humanities, a peer-reviewed open access journal that is on its fourth issue! I think it could be a potential source of material for us to think with, with articles like “Academic History Writing and its Disconnects“. This article is touching on a bunch of issues we talked about in our last meeting, like the death of books and the possibilities and limitations of OCR for historians. Here is a little excerpt:

 At the same time we are confronted by a profound intellectual challenge that addresses the very nature of the historical discipline. This transition from the ‘book’ to something new fundamentally undercuts what historians do more generally. When one starts to unpick the nature of the historical discipline it is tied up with the technologies of the printed page and the book in ways that are powerful and determining. Footnotes, post-Rankean cross referencing, and the practises of textual analysis are embedded within the technology of the book, and its library.