Powerful AI, Dying Data
On an apparently ordinary morning at Gadjah Mada University (UGM) in Yogyakarta, a small piece of history unfolded. It was not due to student demonstrations shaking the campus gates, nor a seminar filled with jargon and cold coffee.
That morning, on the same day, a husband-and-wife duo, Prof Edi Winarko and Prof Tutik Dwi Wahyuningsih, stood side by side and together reached the academic pinnacle: becoming professors. In a country often more preoccupied with chasing titles than pursuing quality, this moment felt like a pleasant anomaly, while subtly prodding.
However, what was more important than the ceremony itself—neither the toga nor the heartfelt thank-you speeches—was one idea that, if followed to its conclusion, could make us slightly uneasy: artificial intelligence is determined not primarily by machine intelligence, but by the quality of data we provide.
In his scientific oration titled “High-Quality Data, Empowered AI: The Importance of a Data-Centric Approach in Applying Artificial Intelligence in the Real World”, Prof Edi Winarko explained a significant shift in the AI world, while giving us a nudge about our position in this realm of imitated intellect.
Until now, AI development has been largely driven by a model-centric approach, improving algorithms, increasing architectural complexity, and expanding the number of parameters. From Convolutional Neural Networks (CNN) to transformers, all compete to become smarter.
However, according to him, the main issue in practice is often not the model, but the data. The same model can produce vastly different performance if trained on different data.
Therefore, a data-centric approach becomes crucial. Here, data must be cleaned, standardised, properly labelled, and continuously improved in quality. AI is likened to a racing car. The engine may be sophisticated, but without quality fuel, it will only spin in place.
The problem is, we are like a nation proud of buying a racing car, but filling it with roadside petrol. Owning a Mercedes, but filling it with adulterated fuel from a street corner. Even more tragic, we are busy polishing the body while letting the engine cough and sputter.
The world today is indeed intoxicated by models. Model, in this context, is the result of learning from data. It is not the data itself, but the patterns absorbed from the data. So if data is the books read, then the model is the content in the head after reading all those books.
In very simple terms, an AI model is like an “artificial brain” that learns from experience. Imagine a small child. He does not yet know what a cat is. Then he sees many cat pictures, hears people say “this is a cat”, and gradually he can recognise a cat without being taught any formula.
That learning process produces “understanding” in his head. That is the analogy with the model. Names like OpenAI, Google, Meta, Microsoft, and Alibaba are mentioned with full admiration because they produce powerful models through training and fine-tuning processes on data.
We talk about GPT, Gemini, LLaMA, DeepSeek, Qwen, Gemma, and various other large models as if they are the pinnacle of human civilisation. We discuss fine-tuning, inference, and latency as if reading a holy book of technology. But we forget one simple thing: everything lives from data.
Then, amid the greatness of those giants, we wonder why Indonesia seems to have no data. Look, for example, in the field of world-class text-to-speech (TTS), few are truly fluent in Indonesian. That is because TTS models lack adequate Indonesian language datasets.
At this point, our irony becomes somewhat amusing, yet sad. We argue over choosing the best model, when the data is not ready. We debate which GPT is the most advanced, but our own language corpus is a mess. We want AI voices to sound “very Indonesian”, but we have never seriously collected Indonesian voices.
It is like wanting to make world-class rendang, but borrowing the meat, importing the rice, and getting the spices from Google.
TTS models like Google’s WaveNet, Microsoft’s VALL-E, Meta’s Voicebox, to OpenAI or ElevenLabs voice systems, present Indonesian half-heartedly. Sometimes it is there, but feels stiff. Sometimes it is not there at all.
If it does speak, it sounds like a foreign tourist who has only learned to say “ngopi” for three days. It can still be understood, but the Indonesian flavour feels “foreign”. This is not because they cannot, but because we do not give them data to learn from.
Let us look further. Indonesia’s contribution to AI research is still relatively small in the global landscape. In various international reports, the number of scientific publications from Indonesia in AI is below one percent of the world total. It is not just about quantity, but also quality and impact.
Then we ask: where is the problem?
Do we lack data? It does not seem so. We have millions of documents, from news articles, literary works, classical texts, to religious sermons scattered across various platforms. We have hundreds of regional languages with extraordinary expressive wealth. We have lively and dynamic daily conversations.
The problem is, all that does not become a dataset. It just becomes a pile. If there is any, it is often not of quality. Quality data is not just about quantity. It must be clean of errors, consistent in format, clear in context, and well-curated.
Data from Wikipedia can be useful, but not enough. Data from classical texts can be very valuable, but needs annotation. Conversational data can enrich models, but must be sorted. All that requires long work: collecting, cleaning, labelling, evaluating, and iteratively improving.
Large companies like OpenAI, Google, Meta, Microsoft, and Alibaba understand this very seriously.