Blog: The Code of Conversation
By: Brianna Tobin
Machine learning and artificial intelligence are allowing an unprecedented understanding of human language.
Machines are often viewed as the antithesis of everything sentimental and human. The scope of a computer’s work has appeared to be limited solely to problems of lifeless computation or uninformed, mechanical number crunching. But while computers historically have not had much say in matters like therapy or communication, today’s rapid development of machine learning and artificial intelligence (AI) means that all of that is about to change.
Contributing to this revolution is Prathusha Sarma, a PhD student at the University of Wisconsin-Madison studying Natural Language Processing and Machine Learning. She has developed a way to produce “domain adapted word embeddings” (as the technique has been officially coined in her paper) from text-based conversation. In essence, she has worked to create an algorithm that can read from sources like tweets or text conversations and then make judgements about a person’s attitudes or sentiments after analyzing the connotations of what that person had said. But how is a computer able to grasp the subtleties of human language?
It turns out that all words can be mapped to their own unique vector in space, and that each of these vectors carry valid information about the meanings of the words they represent. Professor William Sethares from the Electrical and Computer Engineering department, who has overseen Sarma’s work, explains that when two translated words end up pointing in the same direction, it can be assumed that they share similar meanings. For instance, synonyms like “look” and “see” would likely present with similar spatial orientations whereas the word “man” with respect to “woman” might point perpendicularly. Traditionally, this process of word-embedding has been achieved with the use of resources like word2vec or GloVE (computer models, or code, that “learn” the pattern for projecting words into space after being fed large amounts of data).
The feats of these algorithms alone are amazing, as they allow us to draw a substantial number of quantitative conclusions about language. However, the concern with these models is that they train on words obtained mostly from extremely general sources of text like Wikipedia. This doesn’t bode well for words that take on a specialized meaning within different contexts. As Sarma explained, the word “party” could be viewed as a neutral term in everyday speech, but amongst individuals who suffer from substance abuse and addiction, it can have highly negative implications.
“We should not be fearing these big data learning techniques. We should embrace them and try to get the best out them.”
In fact, this is the exact problem that Sarma and her group originally set out to solve. Working in conjunction with the Center for Health Enhancement System Studies (CHESS), the goal was to figure out how disparities between the general use of certain words and the use of those words within the Substances Users Disorder (SUD) community could be accurately represented. Specifically, what types of conversation signal that person is at a greater or lower risk of relapse. Sarma’s novel variation on the vectorization technique described above which accounts for these kinds of distinctions is referred to as “domain adapted” embedding. In her edition of word embedding, the correlation between two regular sets of vectors is visualized by mapping each group into their own, unique space. You could think of this new projection as a weighted average of the two sets, the weights of which get processed by a computer so that it can eventually learn how to classify not only the data that it’s currently being trained on — it could also classify any new data that it might ever be thrown in the future.
By discovering a method that can interpret words so precisely and do so with relatively small sets of data (another challenge that arises when working with specialized domains), Sarma has opened the doors to a new, improved way of processing language. While the initial purpose of her project was to help distinguish healthy individuals from those facing a potential relapse threat, it has since found a purpose in other realms as well. According to Sarma, it was because of Professor Dhavan Shah from the University of Wisconsin-Madison’s School of Journalism that the application of her algorithm towards measuring political polarization is currently under review. After embedding a stream of about 1,000 tweets each from Democrats and Republicans, the group was able to observe how the same word could produce a vastly different vector. The words that measured up to have the most dissimilarity were words that a person would expect to be polarizing, such as “abortion” or “immigration.” This reiterates that there is clearly a validity to these comparisons; they are not just baseless quantifications. And this is just the beginning for Sarma’s work. Another group of students are even using it to measure meme virality, investigating how variation in word usage can indicate how something becomes a meme.
So, what does having all of this knowledge mean? Well, that’s up to us.
Right now, the main accomplishment is the ability to study trends like this in the first place. The types of questions currently being asked are along the lines of “what is characteristic of an at-risk individual?” or “how has political divisiveness evolved time?” That being said, the information available now is only part of the story. Sethares mentions that next immediate step might be learning to extract chains of influence as well. Then we can start to approach questions like: What leads a person to becoming considered more at risk? What really causes someone to lean more liberal or conservative? Eventually, when we can move beyond this stage, this information will hopefully be able to inform real action and intervention. For example, perhaps we analyze headlines and promote moderate titles over extremely partisan ones. Or maybe we alert sponsors to check in with their sponsees whenever they’re engaging in troubling conversation. It’s truly a blank check.
By learning how to quantify and interpret something as elusive as language, we wield a power over it like never before. Problems like addiction and partisanship, while certainly not fully resolved, seem a little bit less helpless this way, as technology is bringing to light connections that we might have elsewise never known existed. After all, when we have the numbers, we have a way to talk about an issue — something to work from. Like Sarma insists, “We should not be fearing these big data learning techniques. We should embrace them and try to get the best out them.” No matter where this field takes us — whatever direction it points us to — there’s no denying that machine learning has forever transformed the role of computers and the way we think about this world.
Read the complete paper “Domain Adapted Word Embeddings for Improved Sentiment Classiﬁcation” by Prathusha K Sarma, Yingyu Liang, and William A Sethares.