Harsh Singhal: The AI Engineer Who Taught Machines To Understand Hate Speech In 20 Languages

Published on:

Harsh Singhal built KooBERT, a groundbreaking multilingual transformer that detects hate speech and toxicity across 20+ languages, transforming content moderation, safety, and personalization on India’s Koo platform and beyond.

Harsh Singhal
Harsh Singhal

In June 2022, Global Witness and Foxglove submitted 20 test advertisements to Facebook ahead of Kenya's national elections. The ads contained explicit hate speech drawn from real-life examples, calling for ethnic violence, rape, and beheadings. Facebook approved them. All 20, in both Swahili and English, passed through the platform's automated moderation systems without being flagged.

It was the third time Global Witness and Foxglove had run similar tests on Facebook, following earlier investigations in Myanmar and Ethiopia that produced comparable results. The pattern across all three was consistent: a platform with billions of users and significant moderation resources continued to fail in non-English linguistic environments where the cultural context, dialectal variation, and script complexity of user content fell outside what its automated systems had been built to handle.

A 2026 Tech Policy Press analysis noted that despite years of industry attention to the problem, the multilingual AI gap had largely been rebranded rather than resolved, with expanded language coverage masking the fact that most AI systems still lacked genuine governance capability across the world's linguistic range. Harsh Singhal spent two years building a more serious answer to that problem, at scale, in India, on a platform where the linguistic complexity was among the highest any social network had ever tried to govern.

What Made India Different

Koo launched in 2020 as a multilingual social platform built to serve Indian users in their own languages, reaching approximately 60 million users by late 2022. That growth put immediate pressure on a content moderation infrastructure that, like most in the world, had been built for English. Research on Indian social media has consistently shown that code-mixing, blending multiple languages within a single post, is the dominant mode of online communication for hundreds of millions of users across the country. A Hindi speaker on Koo might write in Devanagari script, in romanized transliteration, in a blend of Hindi and English within the same sentence, or in any combination, and Indian languages also follow subject-object-verb structures that invert the grammatical patterns English-trained models use to parse meaning and detect hostile intent.

Standard content moderation classifiers, trained on English text corpora, were functionally blind to most of that. Deploying them on Koo's user base would have produced a system that missed genuine hate speech while flagging benign content, with no reliable way to distinguish between the two across ten languages simultaneously.

"The systems that existed had never been built for how people communicate online in India," Singhal said. "You could not fine-tune your way out of it. The starting point was wrong."

Building KooBERT

Singhal joined Koo as Senior Director and Head of Machine Learning in 2021, taking over a team of three engineers, and scaled that team to twenty over his tenure, creating a multidisciplinary group spanning data science, machine learning engineering, and MLOps. His most consequential technical contribution was leading the development of KooBERT, an open-source multilingual transformer model built specifically for Indian-language content. General multilingual models existed at the time, but they had been trained on formal text and performed poorly on the code-mixed, transliterated, script-variable content that characterized Koo's users. KooBERT was engineered to handle those patterns directly, covering more than 20 languages and serving as the foundation for both moderation and content recommendation across the platform.

Mayank Bidawatka, Co-founder of Koo, described the significance of Singhal's contribution. "His technical vision and leadership not only advanced the state of multilingual AI and content safety but also left a lasting legacy in India's digital transformation," Bidawatka said, "demonstrating how responsible AI can empower local communities while setting new standards for scalable, ethical technology in social networking."

Alongside KooBERT, Singhal led the early adoption of Meta's LLaMA models, fine-tuned for multilingual toxicity detection, making Koo one of the first social platforms globally to deploy fine-tuned large language models for real-time safety applications. Deploying LLMs for real-time moderation at social media latency, across ten languages simultaneously, required infrastructure that did not exist off the shelf, and building it meant accepting operational overhead that most teams were unwilling to take on at that stage. "Fine-tuned LLMs for real-time content moderation was well ahead of where the industry consensus was at that point," Singhal said. "The inference latency requirements were tight, the operational overhead was significant, and a lot of smart people thought the complexity outweighed the benefit. We looked at what the alternatives could actually do in our language environment and concluded we needed something better."

Beyond Moderation

The AI systems Singhal's team built at Koo did more than remove harmful content. The same multilingual language understanding that powered moderation also powered discovery, and under his leadership the team built Semantic Search, Multilingual Topics, Feed Ranking, Content Recommendations, People You May Know, and Trending Tags across all supported languages, personalization capabilities that most platforms had never attempted at this scale in Indian languages. Press coverage at the time of the

Topics launch across 10 Indian languages cited Singhal directly, and a subsequent Business World report covered the feature as evidence that AI-powered multilingual personalization could be built to work in production for vernacular audiences at real scale.

The platform's multilingual safety capabilities also attracted recognition beyond India. Content moderation work extended to Portuguese-language content in Brazil, where the platform had a growing user base, adding another layer of cross-linguistic complexity to systems already operating across ten Indian languages.

A Technical Legacy That Outlasts the Platform

Koo shut down in July 2024, following the resolution of the regulatory disputes that had originally accelerated its growth. KooBERT remains open-source. The methodologies Singhal's team developed for multilingual content understanding, combining transformer architectures with code-mixing awareness, cross-script normalization, and fine-tuned LLMs for real-time safety, advanced the technical state of the art in a domain where most of the industry had accepted English-centric tools as the default. In a country with over 750 million internet users communicating across dozens of languages, building AI systems capable of understanding what people are actually saying was among the most consequential engineering problems in the Indian technology sector, and Singhal's work at Koo stands as one of the most thorough attempts to solve it properly.

The above information does not belong to Outlook India and is not involved in the creation of this article.

  • image
  • image
  • image
×

Latest Sports News

Trending Stories

Latest Stories