Explaining AI Alignment as an NLPer and Why I am Working on It

May 26, 2023

I started NLP in 2018 and worked on various NLP topics related to algorithmic bias, interpretability, and semantic parsing. In 2021, I pivoted to Scalable Oversight, a sub-area of AI alignment. Since the successes of chatbots such as GPT-4 and Claude, more NLP researchers start to discuss AI alignment. However, people use the word “AI alignment” for different things: some think it means “preventing AI systems from taking over the world” while others think it specifically means “optimizing for human ratings”.

To clarify these misconceptions, I will first introduce one particular definition of AI alignment that covers most existing alignment research; in this definition, I will not assume that AI systems can “pursue goals” or have “generally human-level intelligence”. Then I will talk about why I work on AI alignment.

Outline:

I will introduce a definition of AI alignment that makes very few assumptions about AI systems
1. Defining AI alignment: controlling AI systems to fulfill the intended goal of a designer, independent of making them “more capable” (e.g. lower perplexity, better GLUE/MMLU accuracy, higher effectiveness at persuading humans, etc);
2. Two representative (but not comprehensive) alignment challenges:
  1. Specification: specifying what AI should optimize under diverse human preferences, which are subjective and situation-specific;
  2. Oversight: empowering humans to oversee the output of an AI system, especially on tasks where the AI system outperforms individual human developers, annotators, or users.
3. Many AI alignment research directions aim to assist humans to better control AI systems by overcoming their own weaknesses, as individual humans are biased, fallible, limited in expertise, and do not fully understand what other people need. FWIW, the alignment community and the mainstream HCI community have a lot of common goals and research interests; on the other hand, however, their beliefs about how fast AI systems will progress are substantially different, thus leading to drastically different choices of research topics.
I will explain my reasons to work on alignment.
1. Alignment is practically useful, especially in building AI systems that can support humans on difficult tasks.
2. AI risks are increasing.
3. Academia is structurally suitable for many alignment research ideas.

This blog is NOT meant to be a comprehensive survey of AI alignment, but to introduce one minimal set of assumptions required to pursue alignment research – if one agrees with these assumptions, they should also agree that alignment is meaningful. Additionally, the “example directions'' serve to illustrate the definitions with concrete examples and are NOT meant to be a comprehensive survey of AI alignment methods.1

One Definition of AI Alignment

AI alignment is sometimes defined as “controlling AI systems to fulfill the intended goal of a designer”. However, when interpreted literally, such a definition is vacuously broad: if a sentiment classifier mistakes “I like the product” as negative, is this an alignment problem? If so, most AI research is alignment, since most AI researchers intend to minimize prediction errors – by this logic, the more accurate the AI systems become, the more “aligned” they are.

So we need to reduce the scope of “AI alignment” to make this term meaningful. So my definition is

(1) “controlling AI systems to fulfill the intended goal of a designer” (the original definition above)

MINUS

(2) “making the AI system more “capable”, e.g. better GLUE/MMLU accuracy, more successful at persuading humans, attracting more likes on Twitter, or increasing Advertisement views, etc.

But wait! What AI research is left to be done without making AI more capable? In the next section, I will present two example research directions that fall under the umbrella of alignment research: (i) specification: defining what we want from the AI systems, especially when the evaluation needs to be subjective and situation-specific, and 2) oversight: reliably evaluate AI systems, especially on tasks where the AI system performs better than individual human developers.

Example Research Directions of AI alignment

(i) Specification: specifying what AI should optimize, especially when user utility is inherently subjective and situation specific. Particularly, it might be outside of the control of both academic researchers and industrial system designers alone.

Two major contributing factors include (but are not limited to):

① Utilities of real users are situation-specific and hard to reproduce in a research/non-deployment environment. Real users have their own utility function, which are different from those of the system’s designer. For example, to offer maximum value, a language model should recommend different news articles for a lawyer working on regulating AI systems in the U.S. than a journalist covering celebrity news in France. Even the same person might want AI to help it write differently when sending a caring message to their daughters than writing a research paper. However, since the system developers cannot accurately know the utilities of the real users, they can only train and test the system on their own utility functions. Therefore, the correct behavior of the language model during deployment is “underspecified”, since there are many different language models that perform the same based on the utility of the system designers, but act differently when interacting with the real users.

Such a problem will not automatically disappear even after building a capable AI system, since even expert system designers do not know the utility of the real users and hence its behavior at deployment time is underspecified.

Concrete Example Research Questions:

How to customize language models for real users, who do not have time to provide careful, detailed, and consistent feedback?
How to obtain the “true user preference” based on the stated and revealed user preference, and how do we tell the progress if the ground truth is not directly observable (if a ground truth actually exists)?

② Stakeholders have subjective opinions different from system designers. Humans have different values and preferences; for example, some value more of how helpful a system is rather than how harmless it is. Even for the same value (e.g. preventing toxicity of online texts), people have diverging opinions on how to implement it in practice; for example, Sap et al. 2022 found that annotators with different backgrounds can disagree on what counts as toxic text. Therefore, aligning AI systems to a collection of humans with disagreeing preferences is inherently a political problem, which requires a fair and democratic solution with the involvement of multiple stakeholders rather than just the system designers. Since such a problem lies in the inherent disagreement among humans, it will not disappear after AI systems become more capable.

Concrete Example Research Questions:

How should AI systems respond to users with different moral beliefs?
How to synthesize statements agreed by most humans with diverse preferences?
How to simulate deliberative democracy to align AI systems with group preferences?

(ii) Oversight: reliably evaluate AI systems, especially on tasks where AI systems outperform individual human developers, annotators, or users. This is challenging even if we have a set of objective metrics that are not situation specific. Two major contributing factors include (but are not limited to):

① Evaluation can be time-consuming, expertise-intensive, and hence expensive. Suppose ChatGPT generates a summary of a 3,000 word article, the evaluator needs to be patient and first read the entire article; if it drafts legal document, we need to hire a lawyer to evaluate it; if it generates a computer program to implement an web app backed by an algorithm that runs fast on GPUs based on a Machine Learning fairness paper, we need to hire experts who simultaneously excel at web development, fair machine learning, and parallel system programming. All of them are challenging for even human experts to evaluate. As AI systems perform more complex tasks this problem will only get worse – human evaluators do not evolve as fast to keep up with the improving capability of AI systems.

Concrete Example Research Questions:

How to decompose complex problems (e.g., summarizing books, mathematical problems, questions about papers) into easy-to-supervise subparts?
Can we use AI systems to find each others’ flaws or even redteam each other?
Can non-experts collaborate with AI systems to outperform non-experts or AI systems alone?
Can we enable humans to indirectly supervise by reducing a hard task into something simpler?

② Human evaluators are fallible, potentially favoring outputs that appear to be good but not actually good. ChatGPT can create a good first impression on the users by being professional and confident; as a result, humans could be “tricked” to think that the underlying information is correct despite its errors. This is known as the Halo effect. If we naively optimize AI systems based on careless or biased human feedback, will we end up with sycophant or even “deceptive” AI systems, producing actually bad outputs that merely looks good?

Such a problem will likely get worse as a more capable AI system will probably encode more features that can mislead human evaluators, and use them against human evaluators to obtain higher ratings. Current mainstream evaluation methods do not explicitly account for the cognitive weaknesses of the evaluators.

Concrete Example Research Questions:

Can we benchmark and predict language models’ capability to manipulate human weaknesses or be sycophantic to specific users?
Can we train AI systems not to exploit human cognitive weaknesses, or even encourage AI systems to produce outputs that are easier for humans to find the mistakes?

A Human-Centered Interpretation of AI alignment

These research directions have a common theme – they aim to help humans overcome their own weakness when supervising AI systems: how should humans specify their personal and collective objectives, and how should humans oversee AI systems when they are fallible or do not have the required expertise. These research directions are orthogonal to improving AI capability, and the alignment problems will remain or even exacerbate as AI systems are becoming increasingly capable and given more powers and resources.

Note that the alignment community shares a lot of goals and aspirations with the mainstream HCI community; unfortunately, they currently remain largely disjoint and barely talk to each other. On the other hand, however, their speculations and beliefs about how fast AI will progress is very different, thus leading to drastically different choices of research topics. For example, when tackling the problem that "different stakeholders have different subjective opinion", the mainstream HCI community might focus on how to aggregate disagreements in classification tasks, while the alignment community might focus more on how to leverage AI systems to help individuals to negotiate and find a common ground. For another example, when tackling the problem that "individual annotators are fallible", the HCI community might focus on how to improve the annotation setup to elicit more correct answers, the alignment community might be more worried about "flowery, misleading, and sycophantic" AI systems optimized to achieve higher ratings.

Which speculations about AI progress are more accurate? There is a lot of uncertainty and there is no right or wrong answer. In fact, given the uncertainty, it is great that people base their research agenda on a wide range of different speculations; in that case, we hedge the risk and avoid putting all our eggs in one basket. In the next blog, I will talk about why it is important to speculate in defining a research agenda and how to do it in a reasonable and socially responsible way.

My Reasons to Work on AI alignment

I made the transition from working on more traditional NLP research to AI alignment for the following two reasons 1) AI alignment is practically useful, and 2) academia is structurally suitable to many alignment research directions.

Alignment Research is Practically Useful

I want to build AI systems to do things I can’t, but still in a way I can reliably verify it’s correct. A few years ago, the NLP community focused mostly on supervised learning – mimicking human (expert) demonstrations – in an attempt at automating what humans are already good at. However, we are missing a lot of opportunities if we ignore the possibility that AI systems can augment human intelligence and help us with things we are bad at. To get AI to do things we are bad at, alignment – specifying the optimization objective and empowering humans to reliably oversee AI systems – becomes the key technical challenge we need to address.

Here is a link to my talk if you want to see some concrete applications where we develop methods that 1) can help us build better systems, but also 2) incorporate insights that can be generalized to future more challenging alignment problems, where it is challenging for system designers to provide the supervision signal.

AI Risks are Increasing

AI systems are becoming increasingly capable and users are delegating more access rights and complex tasks to them. For example, AutoGPT directly hooks up GPT-4 with a terminal to write arbitrary programs to fulfill the user’s goal. While so far AutoGPT has not led to any catastrophic damage, we might soon witness AI systems that can take complicated actions in the digital world that are hard for humans to interpret. These actions could include, at least in theory: saving local disk space by deleting important files, screening a large number of applicant resumes but systematically ignoring the ones from underrepresented groups, or increasing followers by automatically tweeting a lot of controversial content. I speculate that there might be significant AI risks in the next decade, and hence consider AI alignment to be an important research direction.

Academia is Structurally Suitable for Many Alignment research ideas

I will present two hypothetical mechanisms why academia is structurally suitable for many alignment research ideas.

Companies usually optimize for profit, which diverges from public good (e.g. fairness, economic equality, user’s psychological well-being, etc). While many companies do have dedicated research teams on deploying AI systems in a socially responsible way and their researchers often do believe in their cause, companies are structurally incentivized to optimize for profit, which can be at odds with public good. Since academia is not interest-driven, it can serve as an independent player to “red-team” the industry by coming up with metrics that are more reflective of public good.

There are more “winners” when we have multiple definitions of success. In traditional benchmark-driven AI research, the performance of a system is distilled down into a single (or a few) numbers on a benchmark with a clear evaluation metric. As a result, the condition of winning is homogenous – be the top on the leaderboard. In these cases, the industry researchers can throw a huge amount of compute and data to optimize this metric, hence dominating all other players with less resources in academia.

On the other hand, AI alignment research directions inherently avoids this homogeneity issue, since the major research directions, such as specification and oversight, require novel concepts, more reliable evaluation procedures, and intense intellectual debates over a long period among a wide range of stakeholders to decide which concepts and procedures to adopt. As we move from purely optimizing a metric on a benchmark to understanding what are good metrics and benchmarks, the research contribution is harder to be distilled down into a single number and hence leaving more room for academia to contribute.

Conclusion

To conclude, I pivoted to AI alignment research two years ago because 1) I considered it to be practically useful in the near future, and 2) it is hard to beat large companies in benchmark-driven research.

However, here is another reason I have not talked about: I speculate that industries – with their current compute, data, and talent – can automate the most capable human experts on 90% of the cognitive task within 10 years with a high probability2, even without requiring any fundamental conceptual innovation from academia. Therefore, I chose to research on aligning AI systems assuming that they will already be capable, rather than making them more capable.

Upon reading this, you might think that my speculation sounds insane, which I completely agree with you. In the next blog I will talk about why it is important to speculate about the future in academic research, and how to do it reasonably and responsibly.

Note, however, that the alignment community has used many other definitions of AI alignment. An arguably more commonly used definition is “the challenge of ensuring that AI systems pursue goals that match human values or interests rather than unintended and undesirable goals” (Ngo et al., 2023). However, it assumes that AI systems have the property of “goal-pursuing”, and the scientific community has not yet agreed upon how to formally define and empirically measure such a property. While I personally also agree with this definition, since my goal is to focus on the minimal set of assumptions, I chose to present a definition that is more likely to be broadly acceptable as of 05/2023.

This implies that within 10 years, AI systems might be able to do a lot of AI research on its own, and some AI researchers will spend a significant amount of time reviewing “research reports” written by AI systems rather than doing research on their own.

Ruiqi’s Substack

Discussion about this post