Fit-for-purpose machine learning: one AI’s food is another’s poison
By Alan Reid
7th May 2021
Machine learning applications should emulate human decision-making processes but data must be as ‘fit for purpose’ as possible.
Efficiency vs. humanness vs. ethical accountability is the biggest balancing act in AI. While the core promise of AI integration, very often fulfilled, is the performance of erstwhile human-centric tasks at hitherto impossible scale and speed, it is often essential that these tasks be performed in a ‘human way’, but at what cost? Let’s take a step back and ask, ‘why is human-like decision making, or ‘humanness’, important?
First, AI decisions need to be comprehensible and coherent in order for human users and legislators to use and regulate the technology. It is this thinking that underpins the requirements contained in The General Data Protection Regulation in relation to ‘explainability’ and algorithmic decision making. Second, much AI is explicitly tasked with ‘doing the thing a human does, like a human would, only faster’. Somewhat inconveniently, machine learning algorithms have a tendency to arrive at decisions based on ‘noise’ (or ‘meaningless’ data) that is specific to one data set, not generalisable to others and not truly indicative of the problem at hand. A facial recognition algorithm might, for example, use the shapes of people’s shoulders or background scenery to guide its decision making. Competent, effective design of machine learning tools should always strive to ensure algorithms are built in line with human ways of thinking, thus avoiding such noise. On the other hand, AI tools can often end up being too human, exemplifying some of humankind’s worst biases and prejudices.
Take the example of OpenAI’s groundbreaking language tool GPT-3. While GPT-3 has many innovative applications, it is best known as an automated text generator that creates news articles, advertising copy and poetry in response to user prompts, all with the stylistic flourishes of a capable human writer. It can, with little nudging, write an entire opinion piece on immigration in the style of The New York Times, for example. The results are impressive precisely because they convince us of their humanness. But if we are to consider such AI algorithms human proxies, it is essential that designers and engineers think beyond appearances, and long and hard about the types of human values that are conferred, intentionally or unintentionally. In GPT-3’s case, for all its brilliance, its language generation is at times that of a nasty, bigoted interlocutor, creating toxic language from benign prompts. This failing is rooted in a so-called ‘internet-scale training’ method that results in ‘internet-scale biases’. ‘Internet scale’ means that GPT-3 is trained on, more or less, the entire textual content of the internet. For applications with potentially highly-sensitive outputs, this ‘everything but the kitchen sink’ approach to data collection is simply not fit-for-purpose. ‘Bias in, bias out’ is the rule of thumb. Far-right comment sections, illicit forums, subreddits, everything is fair game for GPT-3 as it learns, then reproduces in writing, the most misanthropic of human sentiments. There is a reason that MIT Technology Review recently referred to GPT-3 as ‘the best and worst of AI right now’.
However, the same types of data and training techniques that can corrupt a language generator, can in turn produce a hyper-efficient toxicity detector. Toxicity detection is an established task in computing concerned with the flagging of textual content deemed hate speech (identity-based hate) or some other form of harmful content (e.g. threats of violence). Toxicity detectors put both internet-scale data and toxic data to proper use. In the same way a sniffer dog needs to learn the smell of illicit substances to detect drug smuggling, a toxicity detector needs toxic signals to flag potential harms.
Methodology, expertise, data
Here, internet-scale data usually comes in the form of so-called ‘word embeddings’, numerical encodings of words, derived from statistics about word usage on the web. This approach is based on the dictum of noted linguist John Firth, “you shall know a word by the company it keeps”. For example, the word ‘Norwegian’ may tend to co-occur frequently with words like ‘oil’, ‘fjord’ and ‘taxes’ across the web, while “Irish” might occur with ‘green’, ‘literature’ and ‘dancing’. Word embeddings capture this information and computers use this to make decisions about language. In short, embeddings are a kind of word definition that computers understand.
Toxicity detectors then need another layer of information: genuine toxic data. Because the machine learning model has to be able to cope with different gradations of toxicity, it must also be trained on such nuances, from subtly insidious text, to out-and-out hatred. This means data must be carefully sourced and filtered. Crucially, the data must also be labelled. The importance of this task cannot be underestimated. Labelling is the most explicit signal to the AI as to what is and isn’t toxic. Sourcing, filtering and labelling are usually the task of expert annotators. The annotation team needs to be sufficiently diverse, in background and perspective, to alleviate bias to the greatest extent possible. Tools like CaliberAI’s harmful content detection suite leverage toxicity and bias to create a complex toxicity benchmark that can localize problematic text, present it to the user, and help mitigate risk of legal liability or reputational damage for publishers of all kinds. CaliberAI takes the task of data labelling - again, the most explicit signal we can give a machine learning algorithm - extremely seriously, leveraging the expertise of a diverse panel of annotators, academics and scientists from the fields of ethics, linguistics, journalism, law and computing to create the world’s most sophisticated model of toxicity. This is what fit-for-purpose data looks like, as we strive for efficient yet human and ethical machine learning.
While toxicity detection research is a vibrant, active and diligent space, it is not without its technical and methodological challenges. Although many applications, including CaliberAI’s, currently boast accuracy rates above 90%, researchers at the Universities of Oxford, Utrecht, Sheffield and The Alan Turing Institute recently published an important paper on the need for new qualitative metrics. They highlighted the tendency of toxicity programs to ‘overflag’ content that deals with demographic categories, such as gender, sexuality or religion, especially in complex edge cases, such as ‘reclaimed slurs’ and ‘denouncements’ of hate that refer to hate directly. In short they can be oversensitive.
But as Evelyn Douek, affiliate at The Berkman Klein Center for Internet & Society recently opined in an interview with WIRED magazine, in a world of information chaos, erring on “the side of more false positives” is often necessary, as the “social cost of not using the machines at all” can be too high.