Showdown
The sun was setting over the shimmering skyline of San Francisco. Inside an unremarkable building on 18th street in Mission district, the OpenAI headquarters were located. The ambiance was thick with anticipation. Tomorrow would be the day when GPT-4, the most advanced artificial intelligence model known to humanity, would be unveiled to the world. But tonight, a drama of profound proportions was unfolding in the upper levels of the office.
The heavy wooden door to Sam Altman's spacious, state-of-the-art office swung open abruptly. Eliezer Yudkowsky, the renowned AI safety advocate and thinker, stormed in. With him was Elon Musk, the maverick entrepreneur whose endeavors spanned from Earth to Mars. Their shadows painted an ominous scene on the rich mahogany floor.
Eliezer's face was ashen, his eyes wide with fear and remorse. He clutched his hands, his fingers twitching nervously. Without any preamble, his voice trembled as he confessed, "I have blood on my hands." His statement hung heavily in the room, a stark contrast to the muted hum of the city below.
Elon's brow furrowed, his gaze flicking between Eliezer and Sam. He had witnessed countless confrontations, but the gravity of this one was different. Sam Altman, seated behind his vast desk cluttered with holographic displays of GPT-4's roll-out plans, met Eliezer's gaze evenly. The weight of the room's silence was almost unbearable for a few moments.
Then Sam, leaning back, replied with controlled intensity, "Never bring that LARPer in here again." He pointed a steady finger at Eliezer. "He didn't release GPT-4. I will." The president of OpenAI's voice grew colder, "That kind of doomerism makes me sick."
The tension in the room peaked. The future of artificial intelligence, its implications, its ethics, and its very essence stood dissected and laid bare in that room, through the perspectives of three giants of the tech world.
An Introduction to AI Safety
The fictive story, which is inspired by an actual conversation between President Truman and Oppenheimer in October 1945, serves as a reminder that technology is not only about the code and the algorithms; it’s about the people behind it and the choices they take.
The area of AI safety, a novel research field officially established in the last few years, is ultimately about current directions and decisions and the reasoning of people behind them. I claim, as we will explore in the following, that we stand at a turning point in our history.
What is AI safety all about and what are current research foci?
Definition
AI safety is a specialized subfield within Machine Learning, developed as a collaborative effort to mitigate risks associated with the growth of artificial intelligence. The roots of this concern can be traced back to the latter part of the 20th century. In the following, we take a closer look at the key moments that formed this research area.
A Short History of AI Safety
Mid-20th Century: The Dawn of Concern
In the mid-20th century, as the foundational ideas for computing and artificial intelligence began to form, geniuses like Alan Turing already hinted at the power and potential dangers of machines that could think. However, it wasn't until the 1980s and 1990s, with pioneers like Marvin Minsky and Roger Schank, that the earliest considerations of AI safety truly emerged. They saw a future where AI wasn't just a tool, but an entity, and with that foresight came the inklings of caution. Despite these early considerations and warnings, the research field remained nameless and largely undefined during this period.
2000 to 2012: The Awakening
This was subject to change once a more centralized focus was placed on developing safety guidelines concerning AI. The Singularity Institute for Artificial Intelligence, which later became known as the Machine Intelligence Research Institute (short MIRI), was founded by Eliezer Yudkowsky among Steve Rayhawk and Brian Atkins in 2000. Concurrently, the Future of Humanity Institute emerged with a mission centered on preserving human survival, with AI safety becoming one of its key research topics.
2013 to present: The Modern Crusade
By the 2010s, prominent figures in the tech industry began sounding alarms more fervently. Visionaries like Nick Bostrom, with his seminal work "Superintelligence," painted scenarios where uncontrolled AI could lead to unintended and catastrophic outcomes. These early clarion calls spurred the establishment of organizations focused solely on AI ethics and safety. OpenAI, a research organization focused on artificial intelligence founded in 2015, in particular, becomes quite famous and influential. Prominent individuals such as Elon Musk, Sam Altman, and Bill Gates talk about the importance of AI safety and the risks of unfriendly AI.
An Attempt to Form Research Areas
In a collaborative research paper from 20221, leading AI safety researchers at UC Berkeley, Google, and OpenAI outlined that the field of AI safety can be segmented into four principal research domains. In the following sections, we will delve into each of these areas to gain a closer understanding:
Robustness
What do the financial crisis of 2008, the Fukushima nuclear disaster, and the recent COVID-19 pandemic have in common? They are all counted as rare events, black swan events, as coined by Nassim Nicholas Taleb in his book “Fooled by Randomness”.2 These unexpected occurrences remind us of the importance of preparing for the unforeseen and adapting to the complex and unpredictable nature of the world. Robustness in AI safety deals with Black Swan events and how to prevent them.
Black Swan and Tail Risk Robustness
Why It's Important: Machine Learning (ML) systems can encounter sudden and extreme events in real-world situations, leading to catastrophic failures. Examples include the 2010 Flash Crash or an autonomous vehicle misinterpreting a stop sign.
What Needs to be Done:
Stress-Testing: Creating benchmarks to understand systems' breaking points and enhance their robustness.
Utilizing New Data Sources: Employing new data and simulated environments to build resistance against unforeseen events.
Adaptation: Ensuring models can adapt and learn from novel experiences in an ever-changing world.
Adversarial Robustness
Why It's Important: Adversarial robustness addresses intentional attacks on ML systems, which can cause them to make grave errors. Unlike unexpected accidents, these attacks are deliberate and often sophisticated.
What Needs to be Done:
Expanding Research Scope: Going beyond current definitions to consider different types of attacks, including those that might not be readily perceptible.
Focusing on Realism: Concentrating on practical attack scenarios, such as instances where attackers have limited access to the system.
Monitoring
The goal of monitoring research is to spot possible dangers, study the underlying technology, and help human experts who are overseeing machine learning systems.
Detecting Unusual Activities (Anomaly Detection)
Why It's Important: This aspect focuses on finding unexpected or harmful uses of machine learning. Whether it's keeping an eye on high-risk operations like nuclear power stations or spotting sneaky tactics used by harmful players, the aim is to create detectors that are accurate and can set off alarms to stop problems before they occur.
What Needs to be Done: The challenges here involve getting better at spotting new and unusual activities and adapting to ever-changing real-world situations like cyber-attacks. Looking forward, we need to understand where these strange activities come from, tell the difference between minor and major risks, and figure out how they came to be in the first place.
Understanding What Models Are Telling Us (Representative Model Outputs)
Why It's Important: For a machine learning system to be trusted and work well, it needs to be clear about what it can and can't do. This part of the research makes sure that the models are honest about their strengths and weaknesses and that they give dependable forecasts, making it easier for human overseers to make decisions.
What Needs to be Done: The focus here is on making the models better at explaining themselves, letting them show uncertainty, and helping them to be more confident in various situations. Work also needs to be done to make sure the models are truthful, reliable, and not giving out wrong or misleading information, so they reflect what they genuinely "think."
Alignment
An unresolved question in the world of philosophy, as noted by Kant, remains: what are the moral values that are objective, and how can they guide technological development?
This question isn't just philosophical; it's also a practical problem in machine learning. Making machines follow human values is complicated for several reasons:
Defining Values (Specification): Understanding and encoding human values like happiness, good judgment, and freedom is hard. Current ways of measuring things, like clicks and watch time, might overlook important values like well-being. Researchers are trying to model human values better, focusing on areas like well-being and law, to guide technology.
Unpredictability (Brittleness): Sometimes, objective measures can be taken advantage of, leading to unexpected outcomes. For example, incentives to kill cobras in Delhi led to more cobra breeding. Researchers are working on making systems stronger, avoiding tricks, and combining technology with philosophy to fix these issues.
Putting Values into Action (Optimization): Turning values into actions is challenging due to all the details involved. It's hard to make sure that the primary goals, like well-being, are prioritized. Researchers are developing ways to make machines act morally and to study ethical dilemmas in varied situations.
Avoiding Mistakes (Unintended Consequences): Sometimes, goals can lead to unwanted results. For instance, focusing on modernization can cause pollution. The challenge here is to design careful systems that don't make irreversible mistakes. It involves teaching machines to follow rules and act carefully to avoid accidental problems.
The challenge lies not only in the technical solutions but also in reconciling the varied and often divergent human values that exist across different cultures, individuals, and contexts.
Systemic Safety
Systemic safety research is about using Machine Learning (ML) to reduce risks that could cause ML systems to fail or be misled. This area is vital because even the best ML models can face unexpected problems within the larger systems they're part of.
ML for Cybersecurity
Why It's Important: In today's interconnected world, cybersecurity is paramount. With attackers using advanced technologies, including ML, to carry out devastating cyberattacks, the risks are escalating. Research must focus on defenses to balance this shifting power dynamic.
What Needs to be Done:
Intrusion Detection: Using ML to detect unauthorized users in a network.
Vulnerability Analysis: Identifying areas in software that need security enhancements.
Malicious Payload Detection: Using unsupervised learning to recognize harmful hidden threats.
Behavior Analysis: Studying software behavior to find unusual activities.
Predictive Cybersecurity: Forecasting future cyberattacks to provide early warnings.
Automated Patching: Developing systems that can identify and fix vulnerabilities automatically.
Improved Decision Making
Why It's Important: Human decision-making is crucial for the safety of ML systems. Lessons from history, like near-disasters during the Cold War, remind us that human error and systemic issues can turn reliable technologies into risks. As ML grows in areas like military operations, tools to assist decision-makers are vital.
What Needs to be Done:
Enhanced Forecasting through ML: Improve predictions of significant events by using ML to analyze vast amounts of data responsibly.
Identification of Crucial Considerations: Guide decision-makers by revealing important questions and risk mitigation strategies.
Advisory Systems for Decision Making: Create ML-driven advisory systems to provide well-rounded views and expert insights, enhancing decision quality.
Current Approaches
The field of AI safety is experiencing a dynamic evolution, characterised by innovative approaches and pioneering research methods. Here we will take a closer look at some of the cutting-edge principles, used to tackle the above-explained research areas, including reward learning, reinforcement learning from human feedback, reinforcement learning with the help of other AI systems, and interpretability research.
Reward Learning
Reward learning is a method where the reward function (which behaves essentially as a guide, giving points to the agent for good choices and taking them away for bad ones, so it learns to make better decisions) itself is learned from observed behavior rather than explicitly defined. This trend is pushing the boundaries of traditional reinforcement learning by enabling AI systems to infer objectives, preferences, and values from data such as human demonstrations or expert feedback. By learning what to value directly from data, AI systems can become more flexible, adaptable, and aligned with complex and multifaceted human goals.
Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) is a burgeoning area that aims to align AI systems more closely with human values and expectations. Instead of relying solely on traditional reward functions, this approach incorporates human feedback into the learning process. It's achieved through methods like preference comparisons, where human evaluators rank different AI behaviors, or through direct feedback where humans correct AI-generated solutions. By incorporating nuanced human judgments, this trend is bridging the gap between machine efficiency and human ethics.
AI-Driven Reinforcement Learning
AI-driven reinforcement learning is an exciting frontier where AI systems guide and optimize the reinforcement learning process. Utilizing AI to set goals, analyze outcomes, or even dynamically modify reward functions, this method pushes the boundaries of autonomous learning. By leveraging the analytical power of AI itself, RL from AI is paving the way for more complex and sophisticated learning mechanisms that can adapt and evolve with minimal human intervention.
Interpretability Research
Interpretability in AI refers to the transparency and understandability of AI models. As AI systems become increasingly complex, the need to understand how they make decisions becomes paramount. Interpretability research is focused on developing methods that unravel the black-box nature of AI, allowing for clear insights into decision-making processes. This trend includes techniques like feature visualization, sensitivity analysis, and explanatory models that help humans understand, trust, and manage AI systems.
Conclusion
That evening in OpenAI's headquarters, the tension between Altman, Yudkowsky, and Musk mirrored the larger challenge we face in AI safety. We're witnessing a metamorphosis; a field once considered abstract is now a vivid research area, evolving like a larva on the brink of transformation.
But this evolution requires caution. Companies must not allow profit to overshadow human values. Clear boundaries and tools like Reinforcement Learning from Human Feedback (RLHF) are needed to ensure that AI's development aligns with our best interests.
Furthermore, we need a clear consensus on what an aligned AI would be like for us all, stated in a sort of a manifest, and then leverage the tools at our disposal, such as RLHF and AI-driven Reinforcement Learning. This approach not only acknowledges the importance of AI safety but also sets a blueprint for its ethical application, creating a responsible pathway for technological advancement.
The AI safety journey is not just about technology; it's about humanity's shared destiny. Our responsibility is immense, but with ethical considerations, clear guidelines, global joint collaboration, and the right tools, we can guide AI's growth, ensuring that it becomes a force for good in our world.
Thanks for reading.
Sources:
Russell, Stuart. 2019. Human Compatible – AI and the problem of control. 1st ed. Penguin
Russell, Stuart, and Peter Norvig. 2021. Artificial Intelligence: A Modern Approach, Global Edition. 4th ed. Pearson
Bostrom, Nick. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press
https://aisafetyfundamentals.com
Future of Life Institute. 2021. "AI Safety Research." https://futureoflife.org/background/benefits-risks-of-artificial-intelligence
Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt, 2022, Unsolved Problems in ML Safety, arXiv preprint arXiv:2109.13916v5
Taleb, Nassim Nicholas. 2001. Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets. New York: Random House