
Large language models are trained to solve one problem — predict the next word given some history of Internet text(the context window).
Therefore, any attempt to understand the capabilities and limitations of these models should arise from this next word prediction task.
Tasks like reasoning, arithmetic, human evaluation datasets, law exams etc are a side effect of being trained on this objective and the models are not explicitly trained to solve these kind of problems. That’s the reason why LLMs in their current form are not trustworthy.
If we only focus on these proxy tasks, it makes it very hard to pinpoint why LLMs fail when they do fail.
Authors in this paper focus on a Teleological Approach to understand the LLMs i.e. focus on the problems that the LLMs were trained to solve.
This approach changes the view on all the proxy tasks on which the LLMs performance is generally being reported —
Reasoning ==> Statistical next word prediction model applied to reasoning.
Maths ==> Statistical next word prediction model applied to mathematics.
Law Exam ==>Statistical next word prediction model applied to Law Exam.
etc.
This shows that since the alignment between the next word prediction and reasoning is not 100%, it leads to hallucinations and failure to do simple reasoning tasks from time to time.
Assuming we have not achieved AGI, based on the autoregressive objective of the LLMs, we can make few hypotheses on the types of tests where LLMs will do well and where LLMs will fail —
1. Sensitivity to task frequency
LLMs will perform better on tasks that are frequent in the train database compared to ones that are rare even…