Stylometry is the quantitative study of literary style through computational text analysis. It’s based on the idea that we all have a unique, consistent, and recognizable style in our writing. This includes our vocabulary, our use of punctuation, the average length of our words and sentences, and so on.
A typical application of stylometry is authorship attribution. This is the process of identifying the author of a document, such as when investigating plagiarism or resolving disputes on the origin of a historical document.
In this Quick Success Data Science project, we’ll use Python, seaborn, and the Natural Language Toolkit (NLTK) to see if Sir Arthur Conan Doyle left behind a linguistic fingerprint in his novel, The Lost World. More specifically, we’ll use semicolons to determine whether Sir Arthur or his contemporary, H.G. Wells, is the likely author of the book.
Sir Author Conan Doyle (1859–1930) is best known for the Sherlock Holmes stories. H. G. Wells (1866–1946) is famous for several groundbreaking science fiction novels, such as The Invisible Man.
In 1912, Strand Magazine published The Lost World, a serialized version of a science fiction novel. Although its author is known, let’s pretend it’s in dispute and it’s our job to solve the mystery. Experts have narrowed the field down to two authors: Doyle and Wells. Wells is slightly favored because The Lost World is a work of science fiction and includes troglodytes similar to the Morlocks in his 1895 book, The Time Machine.
To solve this problem, we’ll need representative works for each author. For Doyle, we’ll use The Hound of the Baskervilles, published in 1901. For Wells, we’ll use The War of the Worlds, published in 1898.
Fortunately for us, all three novels are in the public domain and available through Project Gutenberg. For convenience, I’ve downloaded them to this Gist and stripped out the licensing information.
Authorship attribution requires the application of Natural Language Processing (NLP). NLP is a…