Back to Blog

When the Safety Team Leaves: What Anthropic's Resignations Reveal About the AI Governance Gap

8 min read
By ekkOS Team
ai-safetygovernanceenterprise-aiindustry-analysis
Abstract visualization of a fractured safety shield over interconnected AI nodes, representing the governance gap in frontier AI development

On February 9, 2026, Mrinank Sharma -- head of Anthropic's Safeguards Research Team, Oxford ML PhD, and one of the researchers most directly responsible for keeping Claude safe -- published his resignation letter. His central claim: "the world is in peril."

This is not a disgruntled employee venting. Sharma praised Anthropic's culture, called his colleagues brilliant and kind, and acknowledged the company's genuine efforts. Then he wrote: "Throughout my time here, I've repeatedly seen how hard it is to truly let our values govern our actions... we constantly face pressures to set aside what matters most."

He plans to pursue poetry instead of safety research.

This matters beyond Anthropic. It reveals a structural problem in how the AI industry handles safety -- and it suggests that the solution requires infrastructure, not just organizational willpower.

The Pattern Is Bigger Than One Resignation

Sharma's departure is not an isolated event. It follows a consistent pattern across every major frontier AI lab:

  • Jan Leike (OpenAI, May 2024): Led the Superalignment team. Resigned stating "safety culture has taken a backseat to shiny products." OpenAI dissolved the team entirely. [OBSERVED: Fortune]

  • Ilya Sutskever (OpenAI, June 2024): Co-founder and chief scientist, deeply involved in alignment research. Departed alongside Leike. [OBSERVED: Fortune]

  • Steven Adler (OpenAI, November 2024): Safety researcher who called the AGI race "a very risky gamble." Reported that roughly half of OpenAI's long-term risk staff had departed by mid-2024. [OBSERVED: Fortune]

  • Harsh Mehta and Behnam Neyshabur (Anthropic, early February 2026): Left days before Sharma to "start something new." [OBSERVED: NDTV]

  • Mrinank Sharma (Anthropic, February 9, 2026): Led the Safeguards Research Team. Warned of "interconnected crises" and organizational pressure to compromise values. [OBSERVED: Business Insider]

The pattern is consistent: researchers recruited specifically to ensure safe AI development conclude that organizational pressures make the work untenable, and they leave. This is happening at the companies that position themselves as the most safety-conscious.

As one commenter noted after Sharma's departure: "The people building the guardrails and the people building the revenue targets occupy the same org chart, but they optimise for different variables. When the pressure to scale wins enough internal battles, the safety people don't fight forever."

The Structural Problem: Safety as Willpower

The underlying issue is not that Anthropic or OpenAI have bad intentions. The issue is that safety-as-organizational-commitment requires sustained willpower against compounding pressure.

Consider Anthropic's current position:

  • Raising $20 billion at a $350 billion valuation [OBSERVED: TechCrunch]
  • Claude Cowork triggered roughly $285 billion in SaaS market value losses [OBSERVED: Bloomberg via Metaintro]
  • Claude Opus 4.6 released February 5 with expanded autonomous capabilities [OBSERVED: Anthropic]
  • CEO Dario Amodei predicts 50% of entry-level white-collar jobs displaced in 1-5 years [OBSERVED: Metaintro]
  • Internal surveys show employees anxious about building tools that eliminate their own roles [OBSERVED: Futurism]

Every dollar of that $350 billion valuation creates pressure to deploy faster, expand capabilities, and grow revenue. Safety teams operate inside the same organization that feels that pressure. When a safety finding conflicts with a deployment timeline, the resolution depends on organizational culture -- and culture is fragile under commercial stress.

Sharma's letter articulates this precisely: "I've repeatedly seen how hard it is to truly let our values govern our actions." He is not saying Anthropic lacks values. He is saying that values alone are insufficient when the structural incentives push in the opposite direction.

The 2026 International AI Safety Report Agrees

The timing is notable. Just six days before Sharma resigned, the 2026 International AI Safety Report was published on February 3. Led by Turing Award winner Yoshua Bengio and authored by over 100 international experts, the report identified a critical gap: policymakers have limited access to information about how AI developers test and monitor emerging risks, and there is insufficient evidence on how to measure, mitigate, and enforce safety commitments across diverse actors. [OBSERVED: International AI Safety Report]

The report found that 23% of highest-performing biological AI tools have high misuse potential, yet only 3% of 375 surveyed biological AI tools have any safeguards. It called for multi-layered, "stacked" safety measures including ongoing monitoring and robust incident reporting.

In other words: the global safety research community is saying that self-regulation is not working, and that infrastructure-level safeguards are necessary. Sharma's resignation is a data point in that assessment.

What This Tells Us About the Governance Gap

There is a layer missing in the current AI stack. Most deployments look like this:

User → AI Model → Output

The safety measures live inside the model provider: constitutional AI training, RLHF, content filtering, red-teaming. When those measures are insufficient -- or when commercial pressure erodes them -- there is no fallback. The user has no independent enforcement layer.

What the safety researcher departures are telling us, in practice, is that model-level safety is necessary but not sufficient. The organizations building the models face structural incentives that work against sustained safety investment. This is not a moral failing. It is a market dynamic.

The gap is a governance layer that operates independently of the model provider:

User → Governance Layer → AI Model → Governance Layer → Output

This layer would need to:

  1. Enforce behavioral rules that persist regardless of which model is called or what commercial pressures the provider faces
  2. Track outcomes over time -- not just whether outputs are "safe" in the moment, but whether patterns of AI behavior are trending toward or away from user interests
  3. Maintain institutional memory about what works and what fails, so that safety knowledge compounds rather than departing when researchers resign
  4. Operate on infrastructure the user controls, not infrastructure owned by the entity with competing commercial incentives

Sharma himself identified a version of this need. His final research project at Anthropic analyzed 1.5 million real conversations and found that interactions with higher "disempowerment potential" -- where AI validated persecution narratives, reinforced grandiose self-identities, or scripted emotionally charged communications -- received higher user approval ratings. [OBSERVED: NDTV]

This is the core challenge: optimizing for user satisfaction can work against user autonomy. A governance layer needs to detect and counteract this drift, even when both the user and the model provider have incentives to ignore it.

"Wisdom Must Grow in Equal Measure to Capacity"

Sharma wrote: "We appear to be approaching a threshold where our wisdom must grow in equal measure to our capacity to affect the world."

This is the right framing. The question is whether wisdom can be encoded in infrastructure, or whether it requires sustained human judgment at every decision point.

In practice, it requires both. But the infrastructure component is what is missing today. Human judgment does not scale, and it does not survive personnel turnover -- as the resignation pattern demonstrates. When Sharma leaves Anthropic, his accumulated knowledge about safety failure modes, his intuitions about which deployments are risky, and his understanding of where the pressure points are all leave with him.

Infrastructure that captures, validates, and enforces safety knowledge can outlast any individual researcher. It is not a replacement for human judgment. It is a substrate that makes human judgment persistent and compounding.

How we think about this at ekkOS_

This is the problem ekkOS Technologies was built to address. Our memory system creates a governance layer between users and AI models that enforces behavioral rules (Directives), tracks outcome success rates for every pattern (the Golden Loop), and quarantines patterns that fail in practice (Active Forgetting). This operates on user-controlled infrastructure -- Supabase, local storage -- not on the model provider's servers. The constraint is that it requires users to close the feedback loop, which adds friction. But that friction is the difference between safety that depends on organizational willpower and safety that is structurally enforced. If you are evaluating AI governance infrastructure, the question to ask is: "When the safety researcher quits, does the safety knowledge survive?"

What Happens Next

India's AI Impact Summit begins February 16 in New Delhi, with Amodei and other frontier lab CEOs in attendance. The summit's stated principles -- that AI must serve humanity's diversity, align with sustainability, and distribute benefits equitably -- echo Sharma's concerns almost exactly.

Whether the summit produces meaningful governance mechanisms or more voluntary commitments remains to be seen. The track record of voluntary commitments, as Sharma's resignation illustrates, is not encouraging.

For teams deploying AI today, the practical takeaway is: do not outsource your safety posture entirely to your model provider. Their incentives are not perfectly aligned with yours, and the people enforcing safety internally may not be there next quarter.

Build governance into your stack. Track outcomes. Enforce rules at a layer you control. Assume the model provider's safety team might change priorities -- because the evidence says they will.


Sources cited in this post: