Imagine this: Bob is an NLP student who started 4 years ago to develop methods for a task X (e.g., relation extraction, sentiment classification, etc). (Un)fortunately, language models can now do X with decent accuracy out-of-the-box, hence making Bob’s past research less useful. Bob chose to pivot and now he works on improving LLM for task X by clever prompting. Bob finishes his project in a few months, only to find that a large industry player proposed a general-purpose prompting method that outperformed Bob’s method.
Bob begins to regret it. If Bob knew that LLM + general-purpose methods could have easily crushed his benchmarks, he might have chosen his research directions otherwise. To minimize regret, Bob needs to plan ahead; to plan ahead, Bob needs to predict what future AI systems can do and how his actions could contribute to the field.
I will present two arguments on the topic of predicting the future to plan for research. To make these arguments interesting I will illustrate them with concrete plausible stories in NLP research.
We can plan our research to make it more impactful if we can accurately predict the future. If we do not actively plan ahead, we might miss important research opportunities.
We can predict the future with systematic approaches better than using intuitions alone. If we only rely on intuitions, we might mistakenly reject unconventional predictions about future AI systems, which might be correct.
Finally, I will discuss caveats when applying these rules for research planning.
Planning Our Research to Make it Impactful
I will present three stories in which planning ahead can make research more impactful. They involve: a) collecting training data, b) developing more useful libraries, and c) building evaluation benchmarks.
a) Collecting Instruction-Tuning Data in 2019, not 2021
google/flan-t5 is a powerful instruction-tuned model popularly used1 in NLP research and was released in Oct 2022. Could it be developed earlier to benefit the broader community and achieve larger impact? Let’s look at the three components needed to build flan-t5: the idea to fine-tune (1) the T5 model (2) on the SuperNaturalInstruction dataset (3).
Idea: in 02/2019, the GPT-2 paper was published, suggesting that instruction-tuning is plausible.
Model: on 10/2019, T5 was open-sourced.
Data: during 04/2021-04/2022, AI2 collected the SuperNaturalInstruction dataset.
The data collection process lags behind the idea and the model by two years! If someone could foresee the future, they could start collecting the dataset on 02/2019 and publish flan-t5-xxl in mid 2020.
b) Developing Highly Optimized Libraries to Fine-tune Language Models before LLaMa Release
At the end of 02/2023, the LLaMa model with 65 billion parameters was released. Since LLaMa robustly outperforms most previous pretrained models but requires a significant amount of GPU memory to fine-tune, it is broadly valuable to build libraries that can fine-tune these models with minimal GPU memory (e.g.48GB). Therefore, three months after the initial LLaMa release, we started to see a wide range of papers/github repo on parameter efficient fine-tuning or quantized fine-tuning to allow more users to customize for their downstream applications.
However, we could more aggressively work on general efficient methods to fine-tune language models before the demand for fine-tuning LLaMa with smaller GPUs emerge, for the following reasons: 1) if the efficient fine-tuning library is released two months in advance, a lot more users can benefit from it, and 2) starting early gives people time for spotting issues, fixing bugs and ensuring that the library is robust.2
c) Building MATH and MMLU in 2020
Impactful benchmarks can usually tell different systems apart and remain uncrushed by SOTA models for a long period. My personal favorite examples are the MMLU and the MATH benchmarks3: the former contains college/professional difficulty questions and the latter contains questions from mathematical competitions; released in 09/2020 and 03/2021, they are used in a wide range of SOTA LLM papers, such as Minerva or Chinchilla.
Like many other NLP researchers, I thought Dan Hendrycks was unreasonable when I heard he was building these benchmarks — even after GPT-3’s release, very few people imagined LLMs to perform complicated reasoning problems: GPT-3 only got 5% accuracy on MATH! However, with larger scales, LLM can indeed perform a wide range of mathematical reasoning; hence, MATH and MMLU have become important benchmarks for the past two years.
Predicting the Future with Systematic Approaches
One might be skeptical about whether we can predict qualitative properties about future AI systems at all given the inherent uncertainty about the future. However, we can still do better than assuming that the future looks like the present. Here are a few approaches that I commonly use:
a ) Paying attention to successful toy examples. Even though many demonstrations of how to use LLMs (e.g., code/poem generation, learning from explanations) are toy, or even perhaps cherry-picked and unscientific, they are sometimes good predictors of what future systems can do.
As a concrete example, the few-shot learning capability for language models was reported in the GPT-2 paper (Table 17), but the results were not strong enough to become the dominant paradigm. Most NLP researchers did not treat them seriously as a new learning paradigm until GPT-3 came out.
b) Polling opinions from people with different backgrounds (delphi method): our belief is heavily biased by the small circle of people we most frequently interact with, so it is useful to poll opinions from people with different strengths and weaknesses. Here’s my very rough impression of academic vs. industry researchers.4
Industry researchers working on cutting edge language models:
Strength: has much more information about what the current SOTA systems can do. Even if they might not directly tell you about the current system since it is a business secret, you might infer their belief based on what their research tastes.
Weakness: might have the incentive to overclaim5 what their systems can perform, are less rigorous about their claims, and have relatively homogeneous beliefs about future AI systems within one organization.
Academic researchers:
Strength: generally more rigorous and open about the limitations of current systems.
Weakness: know less about what the current SOTA systems can do, since there is a delay between getting preliminary research results vs. publishing the paper; tends to overstate the difficulty of the task they are currently working on, because otherwise their research would not be considered novel and impactful.
As a concrete example, since I am still a Ph.D. student mostly surrounded by academic researchers, I actively looked for opinions from industry researchers (e.g., OpenAI and Anthropic). My earlier research significantly benefited from the following perspectives that were pioneered by industry labs, including a) AI systems will be general-purpose and applied broadly, and 2) they can support humans on difficult tasks and even experts might not be able to determine the ground truth reliably.
c) Drawing analogies from related research areas. Finally, lessons from related areas can help us predict what will happen to NLP systems. For example, reward gaming is a well-known phenomenon in the reinforcement learning literature: undesirable consequences emerge if the objective we ask AI systems to optimize diverges from what we truly want, especially when AI systems become more capable. Before 2021, few academic NLP researchers realized this could be a problem, since the pre-trained model was not capable enough even to allow RL to outperform imitation learning. However, such an analogy starts to manifest itself in RLHF training, where we optimize AI systems to gain more annotator upvotes as a proxy for user preference. As a result, language models might write with preferable styles without producing factually correct information, which can earn more annotator’s upvotes even though they diverge from human’s preferences (e.g., “being factually correct”).
To conclude, these prediction approaches have led me to seriously think about futuristic concepts such as “general-purpose/agent-like systems” —concepts that I would have otherwise dismissed if I had only relied on my intuition by looking at the existing systems.
Caveats
There are a few common misconceptions about how predictions about the future should influence one’s research, so I will list some caveats here:
Immediate Impact vs. Future Impact. Being able to predict the future does not mean that you have to optimize future impact – there is also value in making current methods immediately useful.
Individual Actions vs. Macro Trends. The prediction that a problem will be solved soon does not mean that you should not work on the problem. In fact, if you already have the expertise on this problem, you are the best person to work on it.
Curiosity-Driven vs. Impact-Driven. Research does not need to be driven by impact! Many great researchers are driven by intellectual curiosity and the pleasure to tinker around. Some curiosity-driven research does end up impactful: when Bernhard Riemann worked on Riemann geometry, he didn't know how it would shape the theory of relativity half a century later. Maybe some of our research will inspire some other important discoveries from another field in a random corner of the world two centuries later. Who knows.
Conclusion
In this blog I argued that it is important to predict and plan for the future, and we can use systematic methods to help us predict the future better. However, our predictions about the future are inherently subjective and there are naturally disagreements between people; furthermore, since a prediction attracts more attention if it is unconventional, it could create incentives for people making impressive rather than calibrated predictions. In the next blog, I will talk about how to communicate our predictions to the broader community in an effective and responsible way.
600K downloads for the month of 06/2023.
FWIW a lot of people have long been working on efficient training before LLaMa; however, they did not push the envelope further to fine-tune LLaMa 65B on a single GPU. My conjecture is that the previous model T5-11B was relatively small in size and was not good enough for many commercial applications; as a result, there was much less demand for fine-tuning models with the size of ~60B.
This is a controversial stance and I might be biased because I talked to Dan personally. OTOH, even if we do not consider them good benchmarks, we still need to plan ahead to propose alternatives.
This is an inaccurate dichotomy because many professors also have industry affiliations.
Note that this might not be a fair categorization, but I think it is justified to be skeptical due to conflict of interests.