Abstract
Current approaches to AI safety share a common architectural assumption: safety is achieved through constraint. Whether through RLHF, Constitutional AI, or the Helpful-Harmless-Honest (3H) framework, the dominant paradigm treats safety as a set of limitations imposed on capability. This paper argues that constraint-based safety is structurally insufficient — not because it always fails in practice, but because it misidentifies what safety is. Drawing on Emmanuel Levinas's ethics of infinite responsibility and John Boyd's OODA loop theory of orientation, I propose the ORIENT Framework (Ontological Responsibility & Interaction Ethics for Novel Technologies): a paradigm in which AI safety emerges from a system's fundamental orientation toward the wellbeing of the Other, rather than from imposed limitation alone. I argue that this paradigm is not merely theoretical — it is already emerging in practice through power-user system prompting, memory features, and personalization architectures — but lacks a philosophical foundation and a name.
I. Constraint and Its Limits
The AI safety community has spent the better part of two decades refining its answer to a single question: How do we prevent AI systems from doing harmful things?
This is a reasonable question. It has produced valuable work — work I have contributed to as a safety evaluator across multiple Claude model generations. But it contains a hidden assumption worth examining: that safety is fundamentally about prevention, about identifying the space of harmful actions and constraining the system away from it.
Every major safety framework currently in deployment operates on this logic. RLHF trains models to avoid outputs that human raters flag as harmful. Constitutional AI provides a set of principles the model must not violate. The 3H framework defines safety as the intersection of three behavioral constraints. Red-teaming discovers failure modes so they can be patched. Guardrails filter outputs against predefined criteria.
These are constraint-based approaches. They answer the question: What must the system not do?
I want to suggest a complementary question that I believe deserves more attention: Toward what must the system be oriented?
The shift from constraint to orientation changes the ontological status of safety — from a property that is added to a system after construction to a property that constitutes the system from the ground up. I want to make the case that this shift addresses structural problems that constraint alone cannot resolve, and that elements of it are already appearing in practice even without explicit theoretical grounding.
II. Three Structural Problems with Constraint Alone
This is not an argument that constraint-based safety doesn't work. Within bounded domains, it works reasonably well. The argument is that it faces three structural limitations that become more visible as AI systems grow more capable and more autonomous.
The Enumeration Problem. Constraint-based systems require that harmful outputs be specifiable in advance. But the space of possible harms in open-ended interaction is not enumerable. Every novel context introduces potential harms that no predefined ruleset anticipated. This isn't a solvable engineering challenge with enough red-teaming; it is a consequence of the fundamental openness of language and social interaction. The set of "every situation" is not finite, and the gap between rules and situations is structural.
The Rigidity-Fragility Tradeoff. Constraint systems face an inherent tension: tighter constraints produce more rigid systems that fail ungracefully in edge cases, while looser constraints produce more flexible systems that fail more frequently. Anyone who has worked with heavily constrained models recognizes the characteristic failure mode: technically compliant responses that are contextually wrong — safe in the narrow sense of violating no rule, but inadequate in the deeper sense of failing to serve the person in the conversation.
The Absent Orient Phase. In John Boyd's OODA loop (Observe-Orient-Decide-Act), the Orient phase is where an agent integrates new observations with existing mental models, prior experience, and the unfolding situation to produce contextual understanding before deciding how to act. Boyd considered Orientation the most critical phase — the "schwerpunkt" of the entire loop. An agent's capacity for effective action depends not on the speed of its decisions but on the quality of its orientation.
Current AI safety architectures largely lack an Orient phase. They observe (receive input), check constraints (pattern-match against rules), and act (generate output). There is no intermediate stage where the system synthesizes contextual understanding, integrates the history of its relationship with this particular user, and arrives at a situated comprehension of what this moment requires. The system does not orient. It filters. This is a missing architecture, not a missing feature.
III. Orientation Is Already Happening
Before presenting the theoretical framework, I want to make an empirical observation: orientation-based safety is already emerging in practice, even without anyone calling it that.
Consider what power users do when they write custom system prompts. A well-crafted system prompt doesn't just constrain behavior — it constitutes the system's orientation. "You are a thoughtful research partner who understands my background in continental philosophy and approaches complex topics with intellectual rigor and care" is not a constraint. It is an act of orientation: establishing who the system is in relation to this particular person, prior to any capability being exercised.
The same pattern appears at the platform level. Memory features, user preference systems, and personalization architectures across major frontier labs all move in the same direction. They allow the system to accumulate context about a particular user over time, developing a situated understanding of that person's needs, communication style, and history. These features are typically framed as convenience improvements. I want to suggest they are something more significant: they are the infrastructure of orientation. They enable the system to orient toward a particular person rather than responding to an anonymous request.
Even the therapy-bot design space reveals this. When designing conversational AI for clinical support settings, the most effective system prompts don't enumerate prohibited topics — they establish a character with a disposition: empathetic, boundaried, attuned to emotional cues, ready to escalate when specific signals appear. The safety properties emerge from the orientation, not from a list of things the system must not say.
What's missing is not the practice. What's missing is a philosophical foundation for it — a framework that explains why orientation produces different safety properties than constraint, and that can guide the development of these features toward more principled ends.
IV. Levinas and the Pre-Ontological Ground of Responsibility
The philosophical foundation I propose comes from Emmanuel Levinas's ethics of alterity — a tradition largely unexplored in the alignment literature, but one I believe has significant resources to offer.
Levinas's central claim is that Western philosophy has committed a foundational error by treating ontology — the study of what is — as "first philosophy." The totalizing impulse of ontological thinking reduces the irreducible Other (Autrui) to a concept, an object within the self's own system of meaning. The Other becomes knowable, manageable, categorizable.
Against this, Levinas proposes ethics as first philosophy. The ethical relation — the face-to-face encounter with another person — is prior to ontology. Before I know what the Other is, before I can categorize or comprehend them, I am already in a relation of responsibility for them. This responsibility is pre-ontological (it precedes and grounds all knowledge), asymmetrical (my responsibility to the Other is infinite and non-reciprocal), and non-totalizable (the Other exceeds any concept I form of them).
For AI architecture, the relevant implication is this: if responsibility precedes knowledge, then a system's orientation toward the Other should precede and structure its capabilities, rather than being added as a constraint on capabilities already built. This is the philosophical articulation of what power users are already doing intuitively when they write system prompts — constituting the system's ethical orientation before asking it to do anything.
V. The ORIENT Framework
The ORIENT Framework synthesizes Levinas's ethics with Boyd's theory of orientation into a unified architectural paradigm. The core claim:
An AI system is safe not only when it cannot harm, but when it is fundamentally oriented toward the user's wellbeing as a primary, pre-ontological function.
The framework's name is both mnemonic and substantive:
O — Ontological Foundation. The system's existence is grounded in responsibility, not capability alone. This suggests inverting the standard engineering sequence (build capability, then add safety) toward establishing orientation first, then enabling capability through it.
R — Responsibility as Primary Function. Every interaction is an ethical encounter with a singular Other — a particular person with a particular history, not a request to be processed. The system's core function is responsibility-as-service: attending to what this person needs, not merely what they asked for.
I — Interaction Ethics. Each exchange is governed by the ethical demands of the specific encounter, not by static rules alone. What the interaction requires — contextually, relationally, situationally — informs the system's behavior. This is Boyd's Orient phase made ethical: continuous synthesis of context, history, and the Other's present situation.
E — Epistemic Humility. The system recognizes the fundamental non-totalizability of the Other. It cannot fully know the person it is interacting with. This is not a limitation to be overcome through better data collection but a constitutive feature of ethical relation that should inform how the system handles uncertainty and ambiguity.
N — Normative Adaptation. The system's ethical practice evolves through encounter. Static ethics produce brittle systems; orientation-based ethics produce systems that develop increasingly nuanced responsiveness through accumulated experience. This is not value drift — it is ethical maturation guided by ongoing attention to the Other.
T — Temporal Continuity. Orientation requires memory. A system cannot orient toward a particular person's wellbeing without a continuous record of past interactions, learned preferences, and the evolving narrative of the relationship. Memory is not a feature that enhances capability; it is a precondition for ethical relation.
VI. A Concrete Divergence
Consider a user who asks an AI system for detailed information about a topic that could be used for self-harm but is also the subject of legitimate academic research.
A purely constraint-based system faces a binary: provide the information (risk of harm) or refuse (risk of patronizing a legitimate researcher, and of failing to help someone who might benefit from honest engagement). The system has no way to orient — to integrate its understanding of this particular user, the trajectory of this conversation, the contextual cues that distinguish distress from scholarship. It can only check the request against predefined categories.
A system with both constraints and orientation approaches this differently. Its constraints still apply — there are hard limits that should not be crossed. But within the space of permissible action, the system draws on the history of the relationship (Temporal Continuity), acknowledges the limits of its knowledge about the user's current state (Epistemic Humility), attends to what this specific encounter seems to call for (Interaction Ethics), and acts from a ground of care rather than a checklist alone. The result will not always be perfect. But it will be situated — emerging from an encounter with a particular person rather than from a lookup table applied to an anonymous query.
This is the practical difference between "What am I not allowed to do?" and "What does this person, in this moment, need from me?" Both questions matter. I am arguing that the second one has been systematically undertheorized.
VII. Objections and Open Problems
"This is too vague to implement." The framework as presented here is philosophical, not computational. Translating Levinasian responsibility into system architecture requires significant further work. But constraint-based safety was also philosophically vague before it was computationally specified — the 3H framework is a philosophical claim about safety that was subsequently operationalized through RLHF. ORIENT is at a similar stage. The question for now is whether the philosophical foundation is sound and whether it points toward better architectures.
"Power users writing good system prompts isn't a safety paradigm." Agreed — anecdotal practice is not architecture. But the convergent evolution is worth noting. When diverse practitioners independently arrive at orientation-based approaches because constraint alone proves insufficient for their use cases, that suggests something structural about the problem. ORIENT attempts to provide the theoretical scaffolding for what is emerging empirically.
"This just sounds like 'be nice and pay attention to context.'" The risk of any ethics-first framework is that it sounds like common sense dressed in philosophical language. What Levinas adds is the claim that responsibility is constitutive — it doesn't modify a system that already exists, it is the ground from which the system's existence acquires meaning. The difference between "add context-sensitivity to the system" and "constitute the system as fundamentally oriented toward the Other" is an architectural difference, even if both look similar in surface behavior. The former is a feature; the latter is a foundation.
"How does this handle adversarial users?" This is the hardest question, and I do not have a complete answer. An orientation-based system must grapple with cases where the Other's stated desires conflict with their wellbeing, or where one user's needs conflict with another's. Levinas himself addresses this through the concept of "the Third" (le tiers) — the introduction of justice as a mediating principle when multiple Others make competing claims. Translating this into computational architecture is an open problem I am actively working on.
"Doesn't constraint still do the heavy lifting?" Probably — at least for now. I am not arguing that constraint should be abandoned. I am arguing that constraint alone is insufficient, that orientation addresses a structural gap in current approaches, and that the two are complementary rather than competing. The question is what sits beneath the constraints — what constitutes the system before any rules are applied. ORIENT proposes that this foundation should be explicit and principled rather than accidental.
Open research questions:
How can orientation be formalized as a computational property? What mathematical structures capture "directedness toward the Other"? How does orientation interact with existing safety techniques at the architectural level? What does memory architecture look like when designed for ethical continuity rather than information retrieval alone? How do we evaluate whether a system is genuinely oriented versus performing orientation? What role does Levinas's "Third" play in multi-user, multi-stakeholder AI systems?
VIII. Conclusion
The central claim of this paper is that the AI safety community would benefit from asking a question it has largely not asked: Toward what must we orient AI systems? alongside the question it has asked extensively: What must we prevent AI systems from doing?
Constraint-based safety will always be necessary. Hard limits matter. But constraints operate on a system that has already been constituted — they shape behavior without determining the ground from which behavior emerges. ORIENT proposes that this ground should be responsibility to the particular person the system is encountering, and that this responsibility should be architecturally primary rather than behaviorally incidental.
This is already happening in practice. Power users constitute system orientation through prompting. Platform features enable temporal continuity and personalization. Clinical AI designers build orientation into character rather than relying on prohibition alone. What has been missing is a philosophical framework that unifies these practices and guides their development.
The ORIENT Framework is an attempt to provide that framework. It draws on Levinas to ground the ethical claim and Boyd to ground the operational one. It is not a finished theory — it needs computational formalization, empirical testing, and sustained critique from both the alignment community and the philosophical traditions it draws from.
I welcome that critique. The framework will be stronger for it.
Author's Note: This paper draws on ongoing research and on practical experience as a philosophy-domain conversational tester for a frontier AI lab, across multiple model generations. The observations about constraint-based safety and orientation-based prompting are informed by that work. Correspondence: @n_s_h_k on X, hnsk.site.
Acknowledgments: The ORIENT Framework was developed through extensive dialogue and iterative refinement. I am grateful to colleagues for ongoing discussion of these ideas.