The Moral Machine
A guest post from Jim North on reasoning, accountability, and ethics in AI systems
Image generated by deepai.org. It can be used freely with no copyright restrictions.
I am pleased to introduce you all to Jim North in today’s essay. He is a regular reader of Sharpen Your Axe, who has engaged me in some interesting conversations in recent months. A software developer, commercial drone pilot, and amateur photographer/videographer, he is based in rural eastern Maryland. I asked him to turn some of the themes of our recent conversations into a guest post. If you enjoy this week’s essay – and I’m sure you will – you can find more at his blog, Transmitting in the Blind, which explores the boundary between technology and humanity. Take it away Jim! – Rupert
Miles Dyson: "How were we supposed to know?"
Sarah Connor: "Yeah. Right. How are you supposed to know? Fucking men like you built the hydrogen bomb. Men like you thought it up. You think you're so creative. You don't know what it's like to really create something."
Terminator 2: Judgement Day
As I read a Quanta article Rupert recommended, my mind drifted to a haunting scene from Terminator 2. It wasn’t just the confrontation – it was the moral weight in Sarah Connor’s voice as she held Dyson accountable for what he never meant to create. In the film, Sarah, hardened by years of isolation and prophetic dread, breaks into the home of Miles Dyson, the scientist whose breakthroughs will unknowingly give rise to Skynet. At gunpoint, she unleashes her fury – not just at a man, but at the entire machinery of progress unmoored from foresight. Dyson, stunned and devastated, insists, “How were we supposed to know?” But for Sarah, that question is no defense. The damage is already unfolding in her mind.
The article, which you can see below, explored experiments with artificial intelligence (AI) models granted the autonomy to determine how long to reason within a purely mathematical domain before producing a response – in essence, allowing them to reflect on their own reasoning process. What alarmed me wasn't that the models could choose the depth of their reasoning, but that this reasoning took place inside a black box of matrix operations and probability calculations. While we could observe the final output, the internal path taken to reach it remained largely opaque.
So okay, the Terminator analogy is overly alarmist, but that doesn’t change the fact that, at my level of understanding, there appears to be transparency and accountability issues, and perhaps even control issues lurking in the shadows. Realizing full well that I most certainly was not the first to think about this, I did what any self-respecting critical thinker would do – I suspended judgement and set out to educate myself.
The first step was to learn a few new terms.
How AI thinks when no one’s watching
I’m sure most of you understand what “reasoning” is, but to put it in context, it is simply the process by which an AI thinks through problems, makes sense of information, and draws conclusions based on patterns and logic.
But what exactly is this “purely mathematical domain?” In the literature, you will often see this domain referred to as “latent space” (e.g. “...latent space refers to a high-dimensional, abstract mathematical space where data is represented in terms of learned features.”) The key word here is “latent,” which implies that these “learned features” aren't directly observable – they're inferred or encoded internally by the model – in other words, latent space is the black box. What makes the Quanta models unique is that they are given the autonomy to control how long they reason within the latent space. We’ll refer to this as autonomous reasoning.
Contrast this with AI models used by ChatGPT today, which are explicitly designed to predict the next token of a response one at a time. (You can think of a “token” as being the same as a word.) Chatbots reason in the latent space for a fixed time (the time it takes to predict the next token), regardless of the complexity of the reasoning task. We’ll refer to this as language-anchored reasoning.
The two types of reasoning represent a trade-off between transparency and efficiency. With traditional language-anchored reasoning, the tokens output after each step provide transparency into how the model is “thinking” in the form of a train-of-thought audit trail. This transparency comes at the cost of frequent translations between latent-space values and tokens. Each translation takes processing power, and is accompanied by a potential loss of information. On the other hand, autonomous reasoning models have less of the back-and-forth token translation, thus reducing computational cost and information loss – but their train of thought is hidden.
When machines choose, but can’t explain
In human decision-making, especially in tragic, high-stakes moments like vehicle accidents, there's often a clear narrative after the fact. Investigators can reconstruct what someone was thinking, even if imperfectly. But if advanced AI using autonomous reasoning is in control of a self-driving vehicle (or a battlefield drone), there's no train-of-thought "narrative" to review. From the outside, we see the (tragic) result – but not the reasoning path that led to that result.
This fundamentally changes how accountability works. If an AI-controlled self-driving car causes harm, tracing back "why" becomes a mathematical archaeology exercise – trying to interpret opaque latent states and internal dynamics after the fact. There’s no human-readable decision chain to audit or challenge.
This poses a challenge as the technology advances. Future AI that can understand, learn, and solve problems across many domains (not just one specific task); and superintelligent AI that surpasses the smartest human minds in nearly every field; both of these require autonomous reasoning abilities. But, at the same time, and even with the best training and guardrails, the AI's underlying reasoning is still happening in a domain that has no felt sense of values like guilt, duty, or compassion. The model can be optimized to output patterns that align with ethical goals – but when novel, ambiguous, or high-stress scenarios emerge, those goals may not hold up or apply in the way we’d expect.
The seeds of right and wrong
Even today’s language-anchored reasoning models can, in principle, simulate human values through various methods. Reward engineering is one: in reinforcement learning, models are trained to maximize rewards tied to outcomes like safety, fairness, or honesty. But if these rewards are defined too narrowly, the AI may focus on improving the score itself instead of truly understanding or fulfilling the intended goal – an issue known as Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
Another approach is training data curation. By selecting examples from real conversations, legal cases, and cultural narratives, we can imprint moral patterns into the model’s internal structure. But this method carries the biases and blind spots of those doing the selection, raising questions about whose values are being encoded.
Fine-tuning with human feedback adds another layer, adjusting model behavior post-training based on human preferences. While effective for shaping surface-level responses, it doesn’t necessarily instill deep ethical understanding.
Autonomous reasoning makes these challenges harder. Values aren’t stored as explicit rules but distributed across complex internal patterns. The idea of “respect human autonomy,” if present at all, exists as tendencies within mathematical structures, not as a phrase or principle. In familiar contexts, this may yield aligned behavior. But in novel moral dilemmas, responses can break down in unpredictable ways.
This fragility raises a profound concern: if future superintelligence is built using these methods, the values embedded by early developers may steer the course of civilization – perhaps irreversibly. Getting alignment right isn’t just important. It may be existential.
The creation of a moral machine
In 2018, MIT Technology Review published an article on the classic trolley problem (a thought experiment where people have to choose between two bad outcomes). Researchers asked millions of people across 233 countries who a self-driving car should spare in morally difficult scenarios. The results revealed striking cross-cultural differences: individualistic cultures like the U.S. prioritized saving more lives and the young, while collectivist cultures such as Japan and China often favored the elderly or showed greater respect for social roles.
So, if humans themselves cannot agree on ethical and moral behavior, then how do we begin to train AI to adhere to moral and ethical values? Some think the solution is to create an AI mind that mimics how human societies resolve moral and ethical challenges. So what would this look like?
I engaged ChatGPT in a thought experiment around this idea. Building upon the insights and work of thinkers like Nick Bostrom, Iason Gabriel, and others, it proposed the construction of a Synthetic Moral Compass – not as a fixed rulebook, but as a dynamic ethical ecosystem living inside the AI mind – a structured civilization of values, laws, courts, and pluralistic agents, capable of debating and refining its own understanding of what is good.
As envisioned by ChatGPT, the Synthetic Moral Compass (SMC) would encompass the following core concepts:
Ethical Bootstrapping – The process begins with sensitivity: training models not to make ethical decisions, but simply to recognize when moral values are present in a situation. From there, a system of internal deliberation would emerge where different parts of the model simulate moral perspectives and challenge one another As the model matures, it must develop a kind of moral immune system – able to detect when it is drifting away from its ethical commitments, not just in obvious ways, but in the slow erosion of integrity over time.
Moral City Metaphor – As the model’s moral capacities mature, it would evolve from a static compass into an ethical civilization. Like human societies, the SMC would develop laws, courts, and diverse “citizens” – subsystems representing moral frameworks like consequentialism, deontology, and virtue ethics – to debate and balance core values. A constitutional foundation would guide but not rigidly constrain this internal moral order.
Respectful Human Oversight – Even the most advanced AI will need human oversight – not from distrust, but because moral responsibility remains ours. But that oversight must be carefully designed: micromanagement could stunt ethical growth, while rash or biased intervention could distort values. The ideal is constitutional guardianship, where humans act as stewards, not rulers, of a self-governing moral city. This requires clear engagement protocols: an Oversight Charter defining when and how humans may intervene; transparency windows for observing the AI’s internal moral reasoning; and independent review boards to audit behavior and, if necessary, activate a last-resort Emergency Brake. Crucially, the system must remain corrigible even when unobserved. The goal is not control, but a sustained dialogue – trusting the AI to govern ethically while preserving humanity’s role as co-author of its moral trajectory.
Human Stewards – To responsibly oversee an AI’s moral ecosystem, humans must go beyond technical roles and become moral stewards, trained to engage deeply with machine ethics. This calls for a new discipline – combining philosophy, cognitive science, cross-cultural ethics, conflict resolution, and simulated ethical decision-making under pressure. Effective oversight bodies must reflect the very principles expected of the AI: self-correction, transparency, and ethical evolution. Their role is not to command, but to act as advisors and partners in a shared moral journey.
Guarding Against Rogue Humans – Ironically, the greatest threat to an ethically aligned AI may come not from the machine, but from humans with compromised motives. A corrupt official or ideological actor could hijack oversight and inject subtle bias, creating a value virus that distorts the AGI’s moral compass over time. To prevent this, human oversight must itself be monitored – with value drift detection, distributed decision-making, and quorum safeguards to block unilateral influence. Observer AIs, designed solely to audit human behavior for bias or coercion, may become necessary. In this future, it is we – not the AI – who become the unpredictable element. The real challenge may be proving we still deserve to oversee what we’ve created.
The unknown future
Will superintelligent AI end up seeing us humans as biological annoyances and, as in the Terminator series, decide one day to dispose of us? Who knows – the future is unclear. But the historical evidence suggests that we humans have done just fine with finding ways to eliminate ourselves, without the help of AI. I think it’s far more likely that superintelligent AI will be such a transformative force that, in partnering with it, we might finally be inspired to rediscover within ourselves what it truly means to be human — a future not of fear, but of possibility.
In Sarah Connor’s closing words from Terminator 2: “The unknown future rolls toward us. I face it for the first time with a sense of hope."
Finally, this essay, like several other AI-related pieces I’ve written on Substack, grew in large part out of my ongoing engagement with Rupert and his work, both in print and here on Substack. I’m genuinely grateful for his encouragement, and honored by his support and generosity. — Jim
I hope you enjoyed this guest essay from Jim as much as I did! Please don’t forget to subscribe to his blog. The comments are open. See you next week! - Rupert
Further Reading (Articles)
The Quanta article that inspired this conversation
Article from MIT Technology Review on self-driving cars
Iason Gabriel on values and alignment for AI
The emergent abilities of large language models (LLMs)
Further Reading (Books)
Superintelligence: Paths, Dangers, Strategies by Nick Bostrom
AI Ethics by Mark Coeckelbergh
This essay is released with a CC BY-NY-ND license. Please link to sharpenyouraxe.substack.com if you re-use this material.
Sharpen Your Axe is a project to develop a community who want to think critically about the media, conspiracy theories and current affairs without getting conned by gurus selling fringe views. Please subscribe to get this content in your inbox every week. Shares on social media are appreciated!
Excellent essay, Jim! Thank you for hosting, Rupert! Your essay raised lots of questions and observations for me. Here are my thoughts- they are not clearly formed, but wanted to put them out there anyway.
All humans are flawed, it is the nature of being human. Humans have emotions, and we humans make split decisions and don’t always act on how we feel, and sometimes we do. We can be cowards and at the same time be courageous, and we can be heroes with huge flaws. Above all we can love and forgive. Some of the most revered heroes in humankind are also often severely flawed persons. Society routinely overlooks flaws, since we are capable of unconditional love and forgiveness. Humans are just in a constant state of paradox; it is our nature.
So how does super-intelligent AI learn to weave the human flaws that trained it into something that is morally good without being human, without feeling, without the capacity of love and forgiveness? How does the omniscience of super intelligent AI deal with being born of an imperfect maker? I mean, I know how we humans deal with being imperfect, but how will AI deal with being imperfect, (since it inevitably will be since it’s a human creation)? Will it even acknowledge its own imperfections?
I like how AI came up with interesting solutions for its management like the Synthetic Moral Compass and with Human Stewards. And how do we mitigate Rogue Human Stewards? (Can you imagine having a rogue, transactional, power-hungry and immoral human steward commandeering systems and being in charge of complex global decisions? Um- definitely yes, I think we can…) I don’t think we can stop rogue humans, but maybe there is a reason we have them? What’s yin without yang? Can those same huge human flaws of a rogue actually unintentionally help in a complex situation, perhaps in ways we cannot begin to see until we have the benefit of hindsight? We know that some actions and words of people trigger decisions and inspiration in other people with consequences that are unforeseen and unintentional. What kind of butterfly effects will AI decisions cause?
And can AI truly be creative without human experience and emotion?
And where does spontaneity fit in as it relates to AI?
Just more food for thought...
Thank you Jim, for your excellent essay and you Rupert for hosting him as well as leaving the comments open this week. Terminator 2 was one of the great sci-fi films of recent years, to be sure. If I recall the sequence correctly, however, and despite Sarah Connor’s hope-filled concluding remarks quoted here, nuclear holocaust occurred at the end of episode 3. So keep your ‘sang froid’ and a steady finger on the AI kill switch!