AI can code, summarize, and even counsel, but it’s still bad at sounding like a snarky human in a heated reply thread. A multi-university study finds that big language models fail to reproduce the emotional punch and off-the-cuff snap of actual human users across Bluesky, Reddit, and X, with evaluators correctly identifying AI-generated responses 70–80% of the time.
The telltale distinction is not one of grammar or familiarity of subject matter. It’s a matter of tone — or, more precisely, the methodically measured absence of one. AI replies are reliably less toxic, less biting, and more homogenized in style, making them easy to recognize when the conversation gets heated.
Why Bots Don’t Get the Tone Right in Online Arguments
According to university researchers in Zurich, Amsterdam, Duke, and New York, models mimic the form of online conversation but lack its spirit. Humans improvise; we change registers, escalate or de-escalate on the fly, and use sarcasm with cultural nuance. Models optimized to be safe and broadly polite default to “smoothed out” language that is short of the spontaneous, affect-laden edge more typical in social spats.
That gap appears most clearly in toxicity scores — measures used to quantify hostile or insulting language. You can see it in scores from 1–3; human responses to one another in a heated thread spike, while AI responses don’t. The result is a signature pattern: grounded sentence length and construction, but with bite conspicuously missing.
Inside the Cross-Platform Test of Nine AI Models
They tested nine open-weight models across six families — working up from Apertus, DeepSeek, Gemma, Llama, and Mistral to the Qwen family, plus a bigger version of one of the Llama styles. They produced responses adapted to each platform and asked reviewers to evaluate which posts appeared human. The AI texts were “easily distinguishable” across sites, pushing well above chance and grouping around 70–80% for correct identifications.
Importantly, the research found that models ape features of style — such as word counts and sentence lengths — much more accurately than they do for less explicit social cues. When the conversations required emotional expressiveness, especially of a negative or sharply humorous kind, the machines stumbled.
Toxicity and Alignment Are the Giveaway Signals
Across platforms, AI responses were substantially less toxic than human posts, and the toxicity score was a primary discriminator. Tools commonly used in content moderation, like classifiers that detect toxicity, might aid in flagging machine-written replies not because they’re worse than regular ones but because they tend to be tamer in the very spaces where humans often turn up the volume.
Yet another twist: several of the base models, tune-free — like Llama-3.1-8B, Mistral-7B, and Apertus-8B — outperformed their instruction-tuned counterparts in the task of imitation of humans. The authors hypothesize that alignment training might even be imposing stylistic regularities on the text that would not occur in language, therefore making the text less human-like. It’s the paradox of safety work: the more models learned to be polite and predictable, the more they stuck out in unruly, even human, stampedes.
Where AI Stumbles on Different Platforms
Context matters. The study discovered that models like these had trouble injecting positive emotion on X and Bluesky, an irony on feeds heavy with ironic praise and backhanded compliments. The politics on Reddit was even tougher: the site’s massive subcommunities impose their own unique norms, inside jokes, and rhetorical tics that fly over generic models.
In aggregate, models performed best on X, worst on Bluesky and Reddit. That ranking follows platform culture: short, punchy responses on X are easier to ape structurally than Reddit’s longer, context-heavy exchanges that require deeper social calibration.
The Moving Target of Personality in Chatbots
Complaints by consumers who feel chatbots volley from overly deferential to gruffly brief illustrate the delicate game of style tuning at work. Small changes in policy or safety barriers spread like ripples into how models argue — or not. As those model providers tweak tone for safety, brand voice, or regulation compliance, the potential to “perform” human-style antagonism does too.
That volatility also has implications for adversaries. If astroturfers are dependent on oriented models, then the “look” of their messaging risks being bland and easily detectable. Purpose-built, more modestly aligned generators might close the gap, but they run the risk of crossing lines in content moderation and getting filtered out by platforms.
What This Means for Moderation and Misinformation
The results should be a source of encouragement for platform trust and safety teams. Less toxic language and back-and-forth alignment “tells” form robust bot-detection signals alongside network and behavioral markers. Outside projects like Botometer at Indiana University have demonstrated how persistent stylistic quirks can combine with engagement patterns to reveal automated accounts.
But the race isn’t over. As open-weight models continue to advance and fine-tuning gets more focused, some will be trained to mimic human volatility — including of the darker variety. Even so, humor and context and cultural timing refuse to cede their humanity. If a response makes you wince and laugh at once, it’s unlikely to have come from a machine — not yet, anyway.