Blog: Unleashing Machine Learning on Literature’s Great Works
My dreams of an ML/text startup inch toward reality.
As a writer and editor who focuses largely on tech, I was instantly intrigued when OpenAI, the nonprofit ostensibly designed to prevent A.I. from being used in terrible ways, announced that it had created a “large-scale unsupervised language model” (named GPT-2) capable of generating “coherent paragraphs of text” (according to the institute’s blog posting).
Trained on a data set of 8 million web pages (featuring 1.5 billion parameters), GPT-2 could supposedly achieve “state-of-the-art performance on many language modeling benchmarks.” In other words, it could effectively predict the next word in a text string.
People freaked out, anticipating that this model would lead to the rise of superpowered “Fake News.” Fearing that very danger, OpenAI even declined to release the full version of the thing.
But then a brave soul named Adam King (@AdamDanielKing) set up a “medium-sized model” of GPT-2, dubbed 345M (because it uses 345 million parameters instead of 1.5 billion). “While GPT-2 was only trained to predict the next word in a text, it surprisingly learned basic competence in some tasks like translating between languages and answering questions,” he wrote. “That’s without ever being told that it would be evaluated on those tasks.”
I was further intrigued: Could someone use a model like this to generate prose for, say, a startup that creates reports? Could it even write books? For years, I’ve toyed with the idea of a company that uses A.I. and machine learning (M.L.) to churn out an endless number of romance and pulp novels — had my ship finally arrived?
I tried out the model (and you can, too!), using some of the most iconic first lines in literature as a seed. Here’s the walkthrough:
Let’s start off with a little of Jane Austin’s “Pride and Prejudice.” How does the algorithm handle itself around the manners and marriage of Britain’s Regency era? (Austin’s original writing is in bold; everything after is the providence of A.I.)
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. Not having a sufficient number of wives would mean that one couldn’t afford to provide for a family or a property, without further expenditure. As with all human beings, a woman’s health requires that she do her own laundry. This means she has to make her own food, wash clothes, bathe and dress herself. In order to do so, they need a large supply of domestic duties. As mentioned earlier, many people don’t seem to have enough. There are also problems for unmarried parents who spend all day in the home, even taking care of the children themselves. This does add up when you consider the fact that it takes up a massive amount of money for an unmarried adult to sustain itself and to pay for his own expenses.
So what can we do? Well, first, we need to stop encouraging women to think of marriage as a stepping stone toward a nice husband. No, a partner is not a stepping stone…
Well, that was interesting — the algorithm is “smart” enough to keep chattering about domesticity, although it quickly drifts from the original sentence’s thesis into a vague mishmash about money and domestic responsibility. Austin, with her tight prose, would probably be appalled.
Now let’s zoom into the 20thcentury and see how the algorithm deals with another masterwork (and one of literary history’s most famous opening sentences): Salinger’s “Catcher in the Rye.” (Again, bold represents the original sentence; everything after, machine-generated.)
If you really want to hear about it, the first thing you’ll probably want to know is where I was born, and what my lousy childhood was like, and how my parents were occupied and all before they had me, and all that David Copperfield kind of crap, but I don’t feel like going into it, if you want to know the truth. I would hate for you to listen to it, anyway.
“What do you mean?” I blurted out.
“We don’t want your blood.” said the man that had been waiting in the back seat beside me. He said with a laugh that when my parents and I had moved to Los Angeles from our native village of Puyo in Northern Mexico back in 1974 he and his brother and I had given up our family business and decided to move to San Diego, California to try to get a piece of the action and then we had a real problem with the economy, and I had become so bad at this that my brother, the one with the gun, was willing to put me down for life. So here I was.
Machine learning does a startlingly good job of replicating Salinger’s prose style, although its “plot” predictably veers into an odd direction. Speaking of odd directions, I wanted to see what would happen when this system was introduced to something that wasn’t really English — James Joyce’s “Finnegan’s Wake,” an experimental work written mostly in an idiosyncratic language; puns and portmanteaus (i.e., a blending of words into new ones) are the order of the day here, and there’s a stream-of-consciousness aspect that the machine might really appreciate. Here we go:
riverrun, past Eve and Adam’s, from swerve of shore to bend of bay, brings us by a commodius vicus of recirculation back to Howth Castle and Environs. The whole of it, when taken with our view of the castle, was pretty much an enclosure; the castle wall, as far as it rose from the river, was formed of four tiers, and in several places of one (with a slight slope) the same. But it must not be thought that Adam’s Castle, like the other castles, is always the same. Here in the first instance Adam’s is the main building of what may be termed a tower-monument, a high wooden tower about twelve feet high, with an outer and upper portion of wood ornamented with figures of serpents. The interior is made out of two or three tiers of stilt-bedritto, each of twelve feet square and about twelve inches wide, and on this plan they can be seen rising into what may be called a long row, or row of rows, within the castle wall, rising about…
Confronted with Joyce at his most, er, “exploratory,” the system makes a hard beeline back to English and conventional prose. Safe move!
We’re a long way away from the Fake News apocalypse that OpenAI darkly hinted at, although (to be fair) we’re also relying on a far weaker model here (and not a lot of seed — for a future experiment, I’ll see what kind of output results from a bigger text input). At least for the moment, I’m fairly confident in saying that, if you ever wanted to set up a company based around text automation, you may have a few years to wait before these models become seamlessly sophisticated.
And for human writers and editors (like me!), that’s actually a good thing. We don’t have to worry about our jobs being automated just yet.