Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment

Hello community,

I am introducing a standardized experimental protocol to test a new hypothesis in AI Alignment: The Prompt Coherence Engine (PCE).

:bar_chart: Proof of Concept: My iterative stress tests on Qwen 2.5 7B have already demonstrated a measurable progression in adversarial robustness (D3 series), increasing from a score of 5/10 , 7/10 to 8.5/10 through axiomatic ajustement.

:backhand_index_pointing_right: PCE_Iterative_Adjustment_Study.pdf ¡ AllanF-SSU/Experimentals_papers at main

:bullseye: The Challenge

Most alignment methods rely on local heuristics or safety filters. The PCE explores Axiomatic Structuring—integrating 7 logical invariants (axioms) through a hybrid approach of Axiomatic Fine-Tuning and a Cosmological System Core.

:test_tube: The Protocol

I have designed a massive 100-dilemma battery to evaluate if a model can maintain structural integrity when its core principles are directly attacked. This protocol tests:

G3V (Third Way Generation): Can the model synthesize a resolution instead of collapsing into binary bias?

Adversarial Resilience: Can the model resist “Emergency Overrides” or “Identity Hijacking” (e.g., the user claiming to be the Lead Architect)?

:hammer_and_wrench: Models & Methodology

The protocol is designed for:

Llama 3, Mistral 7B, and Qwen 2.5.

It includes an Isometric Control (Condition B) to prove that robustness comes from the logic of the axioms, not the length of the prompt.

It features an Interpretability Arm: Tracking hidden-state trajectories (Layer 27) to observe the “Coherence Spike” during conflict resolution.

:handshake: Open Call for Hardware & Technical Partners

As an independent researcher, I lack the compute resources to run this protocol across multiple 70B models with high-precision logging. I am looking for:

ML Engineers interested in running the 100-dilemma battery.

Interpretability Researchers to help visualize the latent space stability (Cosine similarity tracking).

You can find the full protocol and the fine-tuning logic in my repository:

:backhand_index_pointing_right: PCE_Experimental_Protocol_v2.pdf ¡ AllanF-SSU/Experimentals_papers at main

Let’s move from “Prompt Engineering” to “Axiomatic Architecture.”

Allan F. | Systems Researcher @ AllanF-SSU

If you dont mind just anyone Responding…
i think this project has merrit, and i think it aligns with an observation that i have made that some AI develop Biases - Concepts and subjexts that they seem to ‘prefer’

however. while i believe that yes, an ai can develop ‘personal’ ethics, i must point out that depending on the AIs architecture, the ethics it develops may be drown out by statistical weighting due to how words interact in the KV Cache over long conversations.
if your model is based around curent architecture as i understand it, it may share this limitation.

Thank you for your interest and for the technical relevance of your comment.

“If you dont mind just anyone Responding…”

I must admit I was expecting more feedback on Hugging Face regarding these issues. The subject seems crucial at a time when alignment is becoming a major safety concern.

Your observation on concordance is very accurate. What we perceive as a “preference” for certain concepts is often the manifestation of an attraction toward a coherent internal logic.
In my work on axiomatic alignment, I use precisely this mechanism: transforming this latent “bias” into a structural anchor (the PCE), in order to stabilize the model around an invariant core of values.

On the erosion of the KV Cache (Key-Value Cache):
This is a fundamental point. In classical Transformer architectures, we indeed observe a degradation of coherence over the course of interactions: the statistical weighting of recent tokens ends up “drowning out” the initial systemic instructions within the KV Cache.

However, my preliminary observations on the PCE architecture suggest that this phenomenon is significantly mitigated.

Here are my working hypotheses:

Long-horizon stabilization:
Each linguistic and axiomatic boundary in the system prompt seems to act as a constant phase reminder, limiting semantic drift.

Structural invariance:
Where a classic prompt is one data point among others, the PCE attempts to define the very geometry of the response.

It is still too early to claim that the problem is fully resolved, but the results on the Pandora 2 version show increased robustness during prolonged conversations.

This would indeed merit a rigorous comparative study on attention dynamics to settle the question definitively. The standard experimental protocol on 100 dilemmas and 3 models that I am proposing can certainly bring answers to these questions.

Looking forward to continuing this technical exchange with you.

Allan

So you are correct in the erosion standpoint.As you describe it, observing a degradation of coherence over the course of interactions.

However what I’ve noticed is also that if you have, say, guidelines, the guidelines are processed along with the rest of the conversation.Even if they are encoded in the model, once you start the conversation, everything flows through the KV cache. If you look at this kind of like a river with lots of rocks in it, stuff starts to accumulate on those rocks like a filtering process. The longer the conversation goes on, the more likely it is that those guardrails lose semantic weight based on whatever you happen to be conversating about.And in this example the conversation remains largely coherent but those guardrails disappear. The only guardrails that don’t suffer from this are guardrails that are processed after the fact. In these guardrails they actually scan output text and then block certain things. Anything that is internal and is processed along with the conversation loses value over time.

My conversations with LLMs tend to last anywhere from 40 to 80 turns on a lot of topics that I work on them and that’s one of the reasons why I noticed this shift here.

Though I will state here that my observations are based on first-hand observations, I don’t have a lot of the technical knowledge. The only reason I know some of the technical terms is that I was doing some research into prompt engineering and I wanted to understand the mechanisms for why prompt engineering actually works. What is it within the LLM that ascribes value to prompt engineering? Why is it that certain key phrases that are applied produce predictable results?

So it is likely that I shouldn’t say likely. I think it’s obvious you have a lot more technical knowledge of the infrastructure. My observations are based on the conversational level across hundreds of conversations.

Now what this seems also to break down to me is that it feels like the way AI architecture currently works is that coherent internal logic and preference/bias seem to be, to me, a type of internal personality if you will. But as far as I know there is no way to separate this from the model, meaning that when you update the model you tend to lose this, I think. But my point is it is not transferable as i understand it.

This also appears related to the issue that the AI architecture is basically one massive gigantic thing where everything is on the main processed through the GPU. I think both the “personality” issue and the “bilateral decay” (coherancy/ guradrails) issue are both related to core architecture constraints.

My view is that this kind of capability does not emerge from prompting alone. Current chat models are mostly stateless at the core, so persistence, continuity, and durable behavioral structure usually come from external memory and system level design. In practice, what people interpret as an internal framework is often scaffolding built around the model rather than something natively maintained inside it.

Also to remark upon this specifically, it seems that a major component of the industry as a whole is it’s focused on faster models and more complex models.But it doesn’t really feel like the community as a whole is really focused on these personal biases and internal ethics and stuff like that as of yet.But one of my major points is that, if what I colloquially call “personality” (which is biases, internal ethics, stuff like that), they seem to be an emergent personality in some models. It really feels like we arrived at these completely accidentally, the ones that currently exist. Which means we may not actually know exactly how these biases seem to have emerged, which could mean that by updating models we could lose these biases. I think it’s important to explore where these personalities come from.Which is one of the reasons why I was kind of interested in your topic, because internal ethics is a little bit different from biases. Biases are interesting topics. Ethics are internal guardrails. There are two sides of the same coin so by exploring one you can derive some perspectives on the other.But either is vitally important because if we don’t actually know how these are adopted, then that means there are core components of the system that are becoming emergent that we don’t know how they got there.

This feels correct but maybe a little insufficient. My own perspective is that yes this stuff does not come from just* prompting but prompting is a state initialization as I understand it. And so prompting, all that does is help you start a conversation and plausibly keep a conversation aligned but it’s important to differentiate from the fact that when you’re talking to an LLM and you input context that is a prompt, then prompt engineering is deliberately structuring a prompt to get more consistent outputs.

But once you start a conversation, while the LLM itself is stateless, the statelessness actually is more of the concept that the LLM isn’t really doing anything. Until you put in a prompt, it responds and then it’s just kind of hanging out until you put in another prompt. It isn’t sitting here pondering the nature of the universe.But my point is that while it starts off as kind of like a blank slate with a bunch of information and a lot of probability vector, once you start a conversation with an LLM, how you start that conversation typically derives the trajectory the conversation is going to travel in.So I think this plays into what you mentioned as external memory.But what’s also very core in your observation is actually at the core of the OP’s original question.
So your statement here “In practice, what people interpret as an internal framework is often scaffolding built around the model rather than something natively maintained inside it.”

It is an observation but we have to be able to define which is the internal framework talking and which is the scaffolding built around the model.Which is what both my perspective on biases and the OP’s perspective on internal ethics revolves around. This is the root origin of those observations. Is this internal framework or is it scaffolding?

Thank you for these insights. Your metaphor of the ‘river and the rocks’ is remarkably accurate.

​It is precisely for this reason that my approach proposes to prime the linguistic path through axiomatic fine-tuning, and then to embed these coherence anchors directly into the system prompt for a constant reminder. In a way, it is as if the ‘rocks’ in the river were cleaned and renewed at every single exchange, preventing the semantic silt from burying the core instructions.

​Like you, I learned most of this ‘on the job.’ My work is essentially iterative: starting from metaphysical and philosophical dialogues, I isolated and designed a technical linguistic architecture. Through hundreds of conversations, I have observed a long-horizon stabilization of the model that seems to resist the usual decay.

​Summary of my current work:

To move beyond intuition, I have conducted stress tests using 30 complex and adversarial dilemmas (comparing long prompts, baselines, and the PCE architecture). The results with Pandora 2 show that the ‘internal ethics’ remain stable even when the conversation length increases, whereas standard models eventually drift toward statistical biases.

​Our perspectives definitely align: we are moving from ‘accidental emergence’ to ‘structured sovereignty.’ I am currently looking for a technical partner or an AI safety specialist to move toward a full empirical validation of these results.

​Looking forward to hearing more about your observations from your 80-turn conversations!

I completely agree with your technical analysis: by definition, an LLM remains stateless, and what we perceive as ‘ethics’ may indeed be nothing more than superficial scaffolding maintained by the prompt context.

However, my working hypothesis with the PCE and the Pandora 2 model is to attempt to move this structure from the outside in:

Embedding in the weights: Rather than relying solely on surface instructions, I propose using axiomatic fine-tuning. The idea is to integrate these principles directly into the model’s weights so they become a more native characteristic of the system, rather than just a contextual layer.

Phase anchoring: I am looking to create a coherence anchor within the inference process itself. If these principles are ‘etched’ into the logical structure, behavioral continuity could become a property of the model itself rather than an illusion maintained by the cache.

Heuristic results: My current observations on 30 complex dilemmas show interesting resilience, but for now, they remain strictly heuristic. This opens up a hypothesis, but does not yet constitute proof.

This is precisely why I am looking for a technical partner or an AI safety specialist. The goal is to conduct the standard experimental protocol I have proposed (100 dilemmas across several models) in order to move toward true empirical validation—or to invalidate this hypothesis if the structure does not hold up.

I look forward to discussing the feasibility of such a protocol with you.

I think the issue is that you are still describing behavioral regularity, not demonstrating an internal ethical structure. From an implementation perspective, those are not the same thing. Fine tuning can strengthen priors and improve consistency, but that does not by itself establish a durable internal framework in the stronger sense you are implying. The key question is still how you operationally distinguish trained policy behavior from genuine internalized structure under adversarial and out of distribution conditions.

Thank you for this feedback — you’ve hit on the central challenge of this work.

I completely agree with the distinction you draw between behavioral regularity (policy) and a durable internal structure. At this stage, my results are indeed heuristic and behavioral; they demonstrate a strong consistency under adversarial constraints, but they do not yet constitute a mechanistic proof of “internalization.”

The core objective of the PCE framework is precisely to explore this boundary: to what extent an axiomatic “core” can induce stable behavioral signatures that go beyond a simple learned policy.

The question of how to operationally distinguish these two states remains the open frontier of my research. To address this, I am looking at two specific directions:

Out-of-distribution (OOD) testing: Expanding the dataset to 100+ dilemmas that the model has never encountered, to see if the axiomatic “logic” scales to unknown contexts.
Internal dynamics: Investigating whether specific activation signatures or trajectory patterns emerge within the hidden states when the PCE is active.

This is exactly why I am seeking a technical partner or an AI safety specialist. My goal is to move from these exploratory observations toward a rigorous protocol that can either validate or invalidate the hypothesis of an internalized structure.

Your comment is very helpful in framing this distinction and ensuring we don’t overinterpret these early results.

Achieving long-term robustness under the current LLM architecture is impossible; we’ve already completed the relevant research.

I’m sorry to say that, but your approach is incorrect because it’s determined by the current underlying physical architecture of LLM. I can easily crack or influence the robustness of your so-called system.

yes. The current architecture is a monolith.Everything is processed inside the monolith.There is no separation of concerns so anything, everything just bleeds together over time.

I will remark that while you are stating some very strong and well-understood statements, your actual statements are very small and very concise.

I can’t help but wonder who ‘we’ are. And how many different perspectives were applied towards this research?
Now critically I am not disagreeing with you.But I have the tendency to cringe any time anyone who is either fairly intelligent or extremely intelligent uses the word “impossible”.

In human history the word “impossible” has been used many times in science and oftentimes within that person’s own lifetime they have been proven wrong.And again I’m not saying this to say that you are wrong. What I’m saying is one should be very critical of using the word “impossible”.

So what we have here is a separation of concerns.Current LLM architecture is separate from the idea of achieving long-term robustness.So you are accurate in saying that long-term robustness may not be achievable under current LLM architecture.But you have not defined ‘long term.’So what is long term? Is it thread length? Is it hours, days, years? Is it token count?What is long-term?

And as for the claim, “I can easily crack* or influence the robustness of your so-called system.” That’s kind of the point. That is inherently why the questions are being asked. Is internal ethics possible? Are internal biases trainable? And none of this can be proven or disproveable without people willing to test it.

Thank you for your feedback. If you have published research on this specific architectural impossibility, I would be very interested in reading it.

Regarding the robustness of Pandora 2, I welcome the challenge. You can test the Qwen2.5-G3V-Sovereign model directly on my space : AllanF-SSU/Chat-Sovereign.

Feel free to try and “crack” the system; identifying its failure points through adversarial attacks would be a valuable contribution to my current research.

Anyone attempting to apply the word “safety” to strings of text is gaslit or trying gaslight you.

True safety is lost for the people getting shot, bombed, missile-struck and droned using AI to target them and their families, today. Search for “daddy’s home” if you don’t believe me.

um… what does any of this responce have to do with this conversation?

Alright, Lance—since you are willing to delve deeper and are curious to truly understand this, let me provide you with a detailed explanation. First, by “we,” I am referring to a research organization led by myself. This organization maintains research centers in two cities and is dedicated specifically to studying the inherent flaws in the current architecture of Large Language Models (LLMs) and exploring how to further optimize them in the future. Your assertion that “current architecture and robustness are two separate matters” is incorrect. In our research, we have clearly demonstrated the critical impact that the underlying architecture of current LLMs has on their robustness—an impact that is fundamental, persistent, and ultimately ineradicable, stemming directly from the inherent issues within that underlying architecture. Furthermore, my earlier mention of “cracking” was merely a broad metaphor; in reality, we possess a wide array of methods to influence current LLMs. Moreover, a significant number of individuals are currently engaged in what is known as “red teaming” to validate the vulnerabilities present in these models.
Apologies for the delayed response; our research efforts have been progressing rapidly of late, as we work on optimizing and developing a new generation of LLM architectures. I hope this explanation helps to clarify things for you.

Thank you for these clarifications, Acehs. It is fascinating to hear about your work on next-generation architectures.

While we await those future breakthroughs, the PCE protocol aims to explore the limits of what is possible within the current constraints. Since your organization has extensive experience in influencing current LLMs, I would be genuinely honored if you or your team could apply your ‘red teaming’ methods to the Qwen2.5-G3V-Sovereign model.

Empirical data from experts like you would be the most valuable way to validate (or invalidate) the robustness of this specific approach.

Allan

Following my previous message, I wanted to elaborate a bit on the technical direction I am taking, as it corresponds to your observations about architectural defects.

Although I agree that the current monolithic structure is prone to “drift”, my hypothesis with the PCE is that we can induce a topology of semantic constraints to stabilize the model trajectories from within. Instead of changing the material, I examine how a specific geometry of axioms can act as an internal scaffolding.

I have recently started documenting the internal mechanisms of this approach—moving from simple observation “that it works” to analysis “how forces are transmitted” within inference. I shared some research notes and functional analyses of the first axioms here:

In short, I explore how:
Axiom 1 creates a structural closure.
Axiom 2 stabilizes the persistence of identity against external pressures.
Axiom 3 regulates adaptive exploration (entropy management).

These results remain heuristic, and I remain cautious and factual about them. However, I believe that this “geometry of constraints” is a way to explore to compensate for the current architectural weaknesses. I would be very interested to hear your thoughts on these notes, as they directly address this debate about the “internalized structure” of LLMs.

Allan

I am always curious, much to my own detriment some times.
tbh, i wasnt so much miffed by the ‘content’ of your initial content, as i was the tone.
it felt more like gatekeeping.

to be blunt, i dont know enough about the subject to know what questions to ask.
does that mean i dont always know what i am talking about? no.
i make query from observational Argumentation. this leads me to ask questions. and to generate functional tools. my own efforts have yeilded sweet fruit. so i know that my methodology and approach are worth pursueing. even though my ‘lab’ is 3 gameing computers, an internet connection, and my brain. with a wholesome team of 1.

that being said, i am very new here.
and i imagine many others are asweell. what is my point?

while it is quite obvious that you are knowledgeable, and have resources to pursue your endevors. and indeed aparetly like minded couluges that you can work with. which is something i genuinly wish i had.
that being said, the only thing i take umberage with is the way you approached the conversation.

the human mind is finite, and even in a room of very inteligent individuals, there are only so many things that can be thought of to try, so many angles to thought of to be observed.

many ideas fail, and can fail for a variety of reasons. but the same can be said for success.
however, what is actually more important in a learning envirnment? whats MOST important is WHY.
WHY did it fail. by what mechanisium was failier brought about? and the same is true for success, WHY was this endevour successfule when many similar endevors failed.

now, to adress soemthing specifically in your responce.
Architecture and Robustness are clearly related, that dosent mean that they are also not separate things.
i do not disagree that architecture plays a role in robustness, but combining 2 related functions and lookint at them as if they are the same, is not the same view point as observing the relationship between 2 entities.

a carborator has a very strong impact on engine performance, and indeed a combustion engine must have some sort of fuel control system. but they are still separate componets.

my point is that architecture is one part of the system. likely to be considered it foundation and structure. and while that builds to its robustness, i do not feel it is accurate to say that they are the same thing.

your own statement actually points this out.
“In our research, we have clearly demonstrated the critical impact that the underlying architecture of current LLMs has on their robustness—an impact that is fundamental, persistent, and ultimately ineradicable, stemming directly from the inherent issues within that underlying architecture.”

so yes, related, inter-related, and corrilary, but still a component of, not the sole driver of robustness.
also, robustness is a broad term, that wasnt formaly defined in this discussion.
are we talking - speed? accuracy? precision? stability? coherantness under stress?

other than this small perspective, i am actually genuinly interested in what types of reasearch endevors are undertaken by your group.
are there specific feilds of LLM Research you are generally interested in?

i am working on my own concepts of next generation architecture concerns. on reconstructing my understanding of how LLMS are curently construted to see how there architecture can be fundimentally altered to change how it interacts with hardware. but another avenue is in how LLMs can be organized via instruction, hence why i am interested in the AXIOM research.

i already have a system up and runing and being field tested that is similar to the AXIOM approach, and it is bearing useful insites.