A few notes on how AI systems learn

Learning in Artificial Intelligence (AI) refers to the ability of AI systems to acquire, process, and use knowledge or skills through experience, study, or being taught. This process enables AI to adapt to new circumstances, improve performance, and make predictions or decisions based on data. There are several types of learning in AI—e.g., supervised learning, unsupervised learning, reinforcement learning. Data are crucial for learning in AI. In fact, data can be considered the cornerstone of how AI systems learn and improve. 

Recent advancements in large language models (LLM) and so-called generative AI are phenomenal. However, as James Somers argued in this interesting article in The New Yorker, these AI systems are ‘threatening their own best sources of data’ and, sooner than later, will be forced to search for new kinds of data on which to be trained.

The key point is that LLMs learn from huge data repositories of human writing, mainly the Web. So, LLM and AI systems become 'intelligent' thanks to 'the artifacts of our intelligence.' For example, Wikipedia was ChatGPT's most important dataset. What happens when Wikipedia users no longer visit Wikipedia and prefer gaining information from ChatGPT (even though the latter was trained on Wikipedia's repository)?  As Somers put it, LLM's 'goal is to ingest the Web so comprehensively that it might as well not exist. The question is whether this approach is sustainable' given that LLM will need ‘new reservoirs of knowledge’ (i.e., data) to keep getting intelligent. Where will this knowledge (again, i.e., data) come from?

First, data could come from us, the users. Soon we could start injecting ‘our most private documents’ into these AI systems. However, there is a limitation given by the fact that AI systems still depend on people-generated knowledge.

Second, it is possible that a major change will happen when AI systems generate knowledge for themselves. One possible way that this could happen is through so-called synthetic data. In other words, an LLM will generate information and another model will ingest them. However, the propensity for hallucination (i.e., making stuff up) would be huge.

Third, to avoid the hallucination problem, AI systems would need to express confidence about a certain type of information, citing sources, and possibly developing ‘a rudimentary kind of self-knowledge.’ An AI system would be ‘curious’ enough to search for reliable information, hypothetically even contacting experts. This is mindblowing.

As Somers explains: ‘[s]uch a system would be like Stack Overflow, Wikipedia, and Reddit combined— except that, instead of knowledge getting deposited into the public square, it would accumulate privately, in the mind of an ever-growing genius. Observing the Web collapse this way into a single gigantic chatbot would be a little like watching a galaxy spiral into a black hole.’

On the data-information-knowldge hierarchy-wisdom, see this HBR article.

On the politics of synthetic data, see this article by Jacobsen.

Previous
Previous

The digital transformation of the US stock market

Next
Next

Book review: Hedged Out by Megan Tobias Neely