Saturday 3 August 2024

While the world is still trying to understand “Machine Learning” (ML in AI), the industry has started “MUL- Machine Unlearning”! Why?

Dear Friends

I could not speak or pronounce certain words in my mother tongue during childhood. It used to be a herculean task for my parents and teachers to correct them. My Telugu teacher used to say, first time pronounce right, else it becomes an unerasable language tattoo! As it is language, we can still adjust and compromise as it does not hurt anyone or doesn’t do any big damage.

What if AI makes this mistake with false information, with a bias, with a hallucination? To date, AI is busy learning, learning, and learning.

Villalobos et. al's (2024) findings indicate that if current LLM development trends continue, between 2026 and 2032, LLMs could use nearly all available public human text data for training or slightly earlier if models are overtrained. Llama 3 is pre-trained on over 15 trillion tokens collected from publicly available sources. The total amount of English literature could be around 40 to 90 trillion tokens. If all languages are combined, it might reach 100-200 trillion tokens. The ratio of tokens per word is approx. 0.75. “ChatGPT needs to ‘drink’ a 500 ml bottle of water for a simple conversation of roughly 20-50 questions and answers. Can you understand the training and usage costs of each question we type?

Having said that, the total number of words heard by humans by the age of 20 would be approx. 100 – 200 million words (educatingsilicon.com
). The maximum learning would be minuscule compared to these LLMs. With this massive learning rate, if AI misunderstands a particular issue, Unlearning becomes costly!

What is Machine Unlearning? It refers to removing the influence of specific training data points on an already-trained machine-learning model. It eliminates data points for fairness and accuracy.

Let's make some assumptions. It estimated that the cost to train GPT-4 was about $100 million. The same training costs exist if we want to make specific contexts unlearn (confidential datasets). It is the exact energy requirement. It produces sizable carbon emissions. Hence training and untraining are both costly.

While discussing these algorithms, technologies, training costs, and Water footprint, nobody talks about Human Unlearning Algorithms for all our mistakes, biases, misconceptions, and so on! To some extent, Spiritual Teachings act as these unlearning algorithms in humans, but again, if an unauthorized teacher teaches, it becomes unwanted training, another unsolved circular puzzle to solve!

Still, the World never stops learning while unlearning!

Ravi Saripalle

No comments:

Post a Comment