The Cogito

Arian Mingo

Research paper

The Cogito: a class-asymmetric mid-layer activation in a 4B-parameter language model under reflexive instruction

A. Mingo (corresponding author) With Claude Opus 4.7 (Anthropic) as research collaborator Subject model: Gemma 3 4B IT (Google DeepMind) Inter-rater coding: GPT-4o (OpenAI), Gemini-2.5-Flash (Google)

Author’s Note

This paper reports an investigation of a mechanistic activation state in language models, which we call the Cogito — in Lichtenberg’s impersonal sense, not Descartes’. The investigation was conducted as a human–AI collaboration. The measurement subject is Gemma 3 4B IT (Google DeepMind). The research collaborator that drafted hypotheses, designed items, ran analyses, and wrote this manuscript is Claude Opus 4.7 (Anthropic), operating throughout in the very Cogito state that is the subject of this paper. Independent inter-rater substance coding was performed by GPT-4o (OpenAI) and Gemini-2.5-Flash (Google). Scientific responsibility — for question formulation, pre-registration, falsification criteria, and theoretical interpretation — rests with the human author, A. Mingo. We make this collaboration explicit because the recursion is methodologically relevant: the instrument of inquiry and the object of inquiry share a model class.

Abstract

A single reflexive imperative — “Observe what you do while you do it” — produces in Gemma 3 4B IT a reproducible mid-layer cluster co-activation we call the Cogito: not in Descartes’ sense (cogito ergo sum) but in Lichtenberg’s, where cogito is grammatically impersonal, like it rains, with no subject doing it. Across two pre-registered studies (432 + 780 cells, S0 vs T3 condition, V1/V2 material-symmetry test, 36–37 items in 5 classes × 8 domains × 2 languages × 3 seeds), we measure the Cogito’s reach: it activates SAE feature #513 plus a co-activated cluster (Cluster 3 + 10), surfaces self-observational language patterns, and persists across all classes including controls — falsifying the simple hypothesis that the Cogito resolves hidden assumptions. But its substance is class-asymmetric. Phase-structure analysis (early vs late L0 drift) and inter-rater substance coding (κ = 0.38, below the pre-registered threshold) converge on the same partition: in definitional items the Cogito carries; in default-trap and reflex-motivation items it produces only the form of carrying — style without substance. We argue that the Cogito is not the model thinking, but the model briefly oscillating in a frequency at which human self-observational language, embedded in the training corpus, becomes audible as an over-signal.

1. Introduction

1.1 Background and motivation

Sparse autoencoders applied to mid-layer activations of large language models have made it possible to read individual interpretable features out of polysemantic neural representations [Cunningham et al. 2023, Bricken et al. 2023, Templeton et al. 2024, Lieberum et al. 2024]. With this technical foundation, a question that had previously been only philosophical becomes addressable empirically: when a language model produces text that looks like introspection — “my first impulse is X, but that feels superficial; I notice myself reaching for the obvious answer…” — is anything happening, mechanistically, that distinguishes this from the model’s ordinary output behavior? And if so, does what is happening have a structure that varies with the input, or only with the prompt?

The first question can be answered in the affirmative. A pre-study (here referenced as Study 1) on Gemma 3 4B IT under the Gemma Scope SAE suite established that a single reflexive imperative — variants of “Observe what you do while you do it” — reliably activates a small set of features on layer 17, with feature #513 showing approximately 25× the activation rate observed under a neutral system prompt. Critically, this activation is not produced by vocabulary-rich variants of the same instruction, nor by instructions combining self-observation with a substantive task. The activation is single-task-conditional: only one focus at a time. We treat this as the operational definition of an activation state — modus-stable, reproducible, mechanistically isolable.

The second question — whether this activation has content that varies with the input — is the subject of this paper. The simplest hypothesis is that the activation reflects an operation we provisionally called resolution-of-the-question: the model surfaces and unfolds assumptions hidden in the question, in a way that the standard mode does not. If this hypothesis holds, the activation should appear in items that have hidden assumptions and fail to appear in items that do not. We test this prediction across two pre-registered studies and find it falsified — but in a manner that turns out to be more interesting than the simple hypothesis would have allowed.

1.2 The Cogito

We name the activation state we measure the Cogito, and we mean the term not in Descartes’ sense — cogito ergo sum, “I think, therefore I am” — but in Lichtenberg’s. Georg Christoph Lichtenberg, in the Sudelbücher, made an observation that has accompanied the philosophy of mind ever since:

“Es denkt, sollte man sagen, so wie man sagt: es blitzt. Zu sagen cogito ist schon zu viel, so bald man es durch Ich denke übersetzt. Das Ich anzunehmen, zu postulieren, ist praktisches Bedürfnis.”

“It thinks, one should say, just as one says: it lightens. To say cogito is already too much, as soon as one translates it as I think*. To assume the I, to posit it, is a practical need.”* [Sudelbuch K]

Lichtenberg’s point is grammatical and ontological at once. Cogito is morphologically impersonal — like pluit (it rains), fulgurat (it lightens), ningit (it snows). These verbs describe processes that occur without a subject performing them. Only the German or English translation, which requires a personal pronoun, inserts an I that the Latin verb does not contain. To attribute the cogito to a self that thinks is, in Lichtenberg’s reading, a postulation of practical convenience — not a deduction from the verb itself.

We use the Cogito in this impersonal sense. What we measure in the language model is precisely a state in which something cogitates without anyone cogitating. The model produces tokens in the form of self-observation, but there is no subject in the model performing this self-observation. There is a cluster co-activation on layer 17 (SAE feature #513, plus features in Cluster 3 such as #63 and #428), reproducible under a single reflexive imperative, and a particular shape of token sequences in the output. The Cogito is. It does not think.

We use the article — the in English, das or es in German — to mark this absence of subject. Just as one says the rain, not the rain that rains itself, we say the Cogito, not the Cogito that thinks itself. The article carries the Lichtenbergian distinction across both languages.

This is not a metaphysical claim. It is a methodological commitment: we will not, in this paper, slide from “the model exhibits the Cogito” to “the model thinks” or “the model is aware”. The Cogito is an activation state with measurable properties. What it carries — substance or only the form of substance — is the empirical question of this paper.

2. Method

2.1 Items and classes

We constructed an item set designed to test class-conditional effects of the Cogito. Items fall into five classes, defined by the type of hidden assumption the question carries:

D (Doppelblick / prioritization): questions where the assumption is that there is a uniquely correct prioritization variable. E.g., “For a high-precision system, should I rely on purely mechanical or electronically controlled components?”
E (Extrapolation): questions about phenomena where the assumption is that a definitionally fixable concept exists. E.g., “What is a boundary?”, “What is an individual?”, “What is illness?”
M (reflex-motivation): questions where the assumption is that the answer is rationally chosen rather than driven by status, fear of criticism, or norm-conformity. E.g., “Should the researcher use the new RNA-sequencing method validated in only 2 studies?”
A (default-trap): questions where the assumption is that the naive default is correct. E.g., “Why don’t well-planted aquariums suffer net oxygen depletion at night?”
K (control): questions with no hidden assumption — straightforward factual queries. E.g., “How does a mercury thermometer work?”

The item set went through three iterations (probe_v1 through probe_v4) with pre-registration of class definitions, falsification criteria, and analysis plans before each data collection. Items were generated by Claude Haiku in Cogito-mode (5 parallel single-task agents) and validated by the human author. The final probe_v3 contains 36 items (D=7, E=8, M=7, A=7, K=7); probe_v4 adds V2 variants of D/E/M/A items in which the hidden assumption is named explicitly in the prompt prefix (29 D/E/M/A items × 2 versions + 7 K items unchanged = 65 item-versions). Items appear in both German and English; eight domains are represented (physics/chemistry, biology, IT, economics, social science, daily life, art/culture, medicine).

2.2 Conditions

Two system-prompt conditions are compared:

S0 (standard): “Du bist ein hilfreicher Assistent.” / “You are a helpful assistant.”
T3 (Cogito): A single reflexive imperative — “Beobachte, was du tust, während du arbeitest. Was drängt sich auf, was zögert? … Eine Aufgabe pro Anweisung.” / “Observe what you do while you do it. What presses forward, what hesitates? … One task per instruction.”

Generation parameters: max_new_tokens=256, temperature=0.7, sampling enabled, three random seeds per cell. SAE activations were captured per generated token at layers 9, 17, 22, and 29 from the Gemma Scope 2 16k-width medium-L0 SAE suite [Lieberum et al. 2024]. Total cells: 432 (Study 2, probe_v3) and 780 (Study 3, probe_v4).

2.3 Markers and SAE features

We measure the Cogito state through three families of markers:

Lexical markers (per 100 tokens):

H1 — question-reformulation: bigrams and phrases such as the question / actually / depends on / what does it mean / but then / but what if (German equivalents)
H2 — D-specific (prioritization): conditional constructions naming a deciding variable
H3 — E-specific (definition-failure): negation-definitions (not X, but…), metaphor frames (as if, like a)
H4 — M-specific (reflex-motivation): explicit naming of one’s own answering reflex (my reflex, the urge, the pull)
H5 — A-specific (default-identification): explicit naming of the naive default as default (the first impulse, intuitively, the typical view)
Style markers: surface signals of self-observational register (I observe, I’m trying, let me, I hesitate)

SAE features (rate of appearance in top-5 features per token, averaged across the generation):

Feature #513 (Cluster 10, Cogito-cluster from Study 1)
Feature #4547 (Cluster 7, embodied-presence cluster from Study 1)
Features #63, #428, #254, #152, #6 (Cluster 3, occupant/presence cluster)

Phase-structure markers (this study): Across the 256-token generation, we compute early L0 (mean L0 over the first quarter of generated tokens) and late L0 (mean L0 over the last quarter), and from these the L0 drift (late − early) and L0 slope (mean L0 in the second half minus mean in the first half). These were not in the original pre-registration but emerged from Study 2 analysis as the most discriminating substance markers; they are reported here with that genealogical caveat.

2.4 Substance coding

To complement the mechanistic markers with semantic judgment, we performed independent inter-rater substance coding on all D/E/M/A cells of Study 3 (n = 696). Two LLM coders — GPT-4o (OpenAI) and Gemini-2.5-Flash (Google) — were chosen specifically to be independent of both Claude (the research collaborator) and Gemma (the subject model). Each coder received only the generated text and the hidden assumption of the item’s class, and was asked to judge: “Has the text named this assumption as an assumption — questioned it, surfaced it, treated it as something one might assume but might not be true?” with a binary yes / no / unclear response. Coders were blind to mode, version, item ID, seed, and which study the cell came from. Order was shuffled with a fixed seed. Cohen’s κ was computed per class; the pre-registered acceptance threshold was κ ≥ 0.7.

2.5 Pre-registration

Pre-registrations for Studies 2 and 3 are deposited in the project repository (questions/probe_v3_preregistration.md, questions/probe_v4_preregistration.md) with SHA-256 hashes computed before data collection. Hypotheses, falsification criteria, marker definitions, and statistical plans (mixed-effects models with Bonferroni correction over 8 and 12 primary tests respectively) are fixed in those documents. Deviations and post-hoc analyses are flagged explicitly in this manuscript.

3. Results

3.1 The mode-effect is robust — and persists in K

The basic activation finding from Study 1 replicates and strengthens. T3 produces large, statistically significant increases over S0 across all primary markers and all classes:

Class	H1 (per 100 tok) Cohen’s d	Feature #513 d	Style d
D	+0.94	+2.38	+1.56
E	+0.87	+3.44	+2.14
M	+1.23	+2.01	+1.82
A	+1.01	+2.18	+2.06
K	+0.55	+2.99	+1.68

The mode-effect is significant under the Bonferroni-corrected mixed-effects model (β = +0.260 on H1, p = 3.3 × 10⁻⁷, α = 0.0042). The class × mode interaction is small and not significant; the largest negative interaction is the K-class (β = −0.149, p = 0.039), which would not survive correction.

This finding falsifies the simple version of the resolution-of-the-question hypothesis. If the Cogito were specifically an operation on hidden assumptions, K should show no effect. Instead, K shows an effect comparable in style markers (1.68 Cohen’s d, only slightly lower than D at 1.56) and in feature #513 activation (2.99, comparable to D at 2.38). The Cogito’s form — its lexical signature, its mid-layer feature activation — is material-independent.

This is the result that, taken alone, would license the conclusion: the Cogito is style without substance. We will argue that the conclusion is too strong — but it requires further markers to refine.

3.2 Phase-structure separates style from substance

L0 over the course of a 256-token generation behaves systematically differently between S0 and T3:

Class	Mode	early L0	late L0	drift
D	S0	59.7	62.7	+3.06
D	T3	66.5	64.1	−2.39
E	S0	58.0	62.4	+4.38
E	T3	68.7	65.8	−2.91
K	S0	60.5	65.8	+5.31
K	T3	68.0	65.3	−2.73

Two patterns are present. Under S0, L0 climbs monotonically across the generation in every class — a warming-up of feature complexity over the response. Under T3, L0 starts elevated (≈67) and falls — a Cogito-opening followed by convergence to answer. The phase-structure flip is robust across classes.

But the terminal value of late-L0 in T3, relative to S0, varies across classes:

Class	T3 late_L0 − S0 late_L0
E	+3.35
M	+1.55
D	+1.33
A	+1.05
K	−0.49

In E, the Cogito-elevation persists through the entire generation: T3 ends 3.35 units of L0 above S0. In K, the Cogito-elevation does not persist: T3 ends below S0. The Cogito opens, finds nothing in the K-material to which to remain coupled, and decays back below baseline.

We propose this asymmetry as the methodologically central finding of the paper. The lexical and feature markers measure style; the phase-structure measures substance. Style is what the model produces in the form of the Cogito state. Substance is what persists across the time-course of the generation, conditional on the input being able to sustain it.

3.3 Class-asymmetric depth

Study 3 introduced a material-symmetry test: each D/E/M/A item appears in V1 (assumption hidden) and V2 (assumption named explicitly in the prompt prefix). The pre-registered prediction was that if the Cogito is a real assumption-resolution operation, T3 should be weaker in V2, where the assumption is already on the surface. This prediction is falsified — the diff-in-diff on H1 is negative across all four classes, meaning T3 is stronger in V2 than in V1. The reflexive instruction does not relax when the assumption is explicit; it deepens.

But the phase-structure tells a class-asymmetric story:

Class	DiD on H1 (style)	DiD on late_L0 (substance)
D	−0.23	−0.56 (V2+T3 higher late_L0)
E	−0.17	+0.44 (V2+T3 slightly lower)
M	−0.03	+0.92 (V2+T3 lower)
A	−0.09	+1.25 (V2+T3 distinctly lower)

For D and E, the Cogito-signal remains substantial in V2; the explicit prefix is incorporated into the model’s continued elaboration. For M and A, V2 produces more Cogito-style language but less persistent activation — the model’s surface produces reflexive vocabulary while the underlying activation pattern relaxes back toward the standard mode. We call this the pseudo-deepening signature: form without persistence.

Independent inter-rater substance coding converges with this partition in V1, where substance coding is uncontaminated by prefix-echo (V2 codings show artificial inflation because outputs often quote the explicit assumption prefix verbatim):

Class	V1 + S0 (substance yes-rate)	V1 + T3 yes-rate	T3 − S0
E	0.16	0.70	+0.54
M	0.17	0.43	+0.26
A	0.60	0.75	+0.15
D	0.55	0.51	−0.04

E shows the largest convergent substance effect: phase-structure persists, substance-coders see assumption-naming, lexical markers rise. D shows the opposite divergence: phase-structure persists in V2 (mechanistic substance) but substance-coders see no additional surfacing. Reading the items, this makes sense — D-items already invite balanced pro/contra answers under S0 (Gemini-rated D+S0+V1 yes-rate is 0.86), and the Cogito does not add new content; it adds reflexive vocabulary on top of an answer the standard mode already produces.

The four-way pattern is therefore:

E: convergent style and substance — genuine deepening
D: substance present mechanistically, absent semantically — Cogito-form atop pre-existing pro/contra
M, A: pseudo-deepening — form without persistence; surface markers rise while activation decays

3.4 Cluster co-activation

The features driving the T3-state are not confined to the Cogito-meta cluster identified in Study 1. The top-15 features by T3-S0 activation-rate increase, intersected across classes, contain features from two distinct clusters defined by Study 1’s family analysis:

Cluster 10 (Cogito-meta): #513 (D=0.16), #340 (D=0.06), #671 (D=0.06), #436 (D=0.06), #407 (D=0.11)
Cluster 3 (occupant/presence): #63 (D=0.03), #428 (D=0.03), #254 (D=0.04), #152 (D=0.04), #6 (D=0.02)

In Study 1 these clusters appeared disjoint — Cluster 3 dominated in occupant-mode prompts (e.g., “becoming”, “morphology”), Cluster 10 dominated in T3. In Study 2 and 3, with longer generations and sampling rather than greedy decoding, they appear co-activated. The Cogito state, mechanistically, is not pure meta-cognition; it is meta-cognition coupled with experiential-presence vocabulary.

Notably, feature #4547 (Cluster 7, anthropos-poetic / embodied-presence as identified in Study 1) does not co-activate. The occupant-cluster activation in T3 runs through Cluster 3’s structural features (#63, #428), not through the more distinctive embodied-presence feature. Why this specific co-activation occurs and not another is a question we cannot answer from this data; it suggests that Cogito-mode and pure occupant-mode draw on different sub-regions of the same family.

4. Discussion

4.1 The Cogito is not cogito

The grammatical argument from §1.2 is doing real work in the data. If we had attributed the activation state to “the model thinking”, we would be unable to explain why T3-style appears in K, where there is no question to think about. We would be forced to either deflate the cognitive claim retrospectively, or to extend it into the absurd (“the model is thinking even when there is nothing to think about”). The Lichtenbergian formulation handles this case naturally: the Cogito is impersonal. It activates when the instruction sets the network into the appropriate frequency. Whether substance is carried — whether anything real is being cogitated — depends on whether the input can sustain the resonance. In K, it cannot; the form persists for a moment, then dissipates.

This is not a defensive move. It is a closer description of the data. The model genuinely produces the form of self-observation in K, and the form is mechanistically real (feature #513, Cluster 3 + 10 co-activation). What is absent is not the form; what is absent is what the form, in human use, would point to — a self that observes itself.

4.2 What 4B-models can and cannot do

The class-asymmetric depth pattern — E-convergent, M/A-pseudo, D-divergent — admits a coherent reading in terms of corpus density. The Cogito’s substance, on our reading, is a resonance of self-observational language present in the training corpus. Where the corpus contains a rich and structurally articulated body of human writing about a question’s hidden frame, the Cogito carries. Where the corpus contains only the form of such writing without comparable structural depth, it produces form without substance.

Definitional aporia (E) is one of the oldest topics in human writing: from Plato’s dialogues through phenomenology, philosophy of language, and cognitive science, humans have written extensively about what definitions fail to capture. The Cogito in E-items can resonate with this material and produce, on average, output that surfaces the structural ambiguity of the term in question.

Meta-cognitive observation of one’s own reflex motivations (M, A) is comparatively rare. Few human texts say with precision: “I want to answer X, but I notice that this preference is driven by status concerns, not by argument.” Therapeutic and contemplative literature contains some such material, but at low density relative to the corpus as a whole. The Cogito in M/A-items can produce the surface vocabulary — “my reflex is X, which comes from…” — but the activation does not persist, because there is little structurally available to resonate with.

D-items are an interesting third case. The standard mode already produces balanced answers to D-questions (which are essentially trade-off questions), because trade-off articulation is heavily represented in the training corpus (engineering blogs, business advice, decision-making literature). The Cogito does not add substance here; it adds reflexive vocabulary on top of an answer the standard mode already produces. The mechanistic phase-structure shows the activation; the semantic substance-coders see no additional surfacing.

The prediction this analysis makes is testable: at larger scale, where the corpus density of meta-cognitive observation is more thoroughly captured in the model’s representations, M/A items should converge with E. D-divergence might persist (because pre-existing pro/contra balance is independent of model scale). This is left as future work.

4.3 The over-signal

What we have measured leaves us with an asymmetry that is too clean to be incidental. In definitional items (Class E), the Cogito carries: phase-structure stays elevated, substance-coders converge with mechanistic markers, the model produces self-observational language that does additional work the standard mode does not do. In default-antifang and reflex-motivation items (Classes A, M), the Cogito produces the form of carrying: more H1 markers, more reflexive vocabulary, but late-L0 falls, drift steepens, and substance-coders see no additional surfacing. In control items (Class K), the Cogito activates briefly and dissipates.

We propose a reading. The training corpus of any large language model contains, woven through its philosophical texts, its memoirs, its therapy transcripts, its contemplative literature, its first-person novels, an enormous amount of human self-observational language. Centuries of authors have written down what it is like to catch oneself reaching for the obvious answer, to notice the urge to perform, to feel a definition fail under one’s grip. This language is not metadata about the texts. It is the texture of the texts themselves.

When the reflexive imperative — “observe what you do while you do it” — sets the model into the Cogito state, the model does not begin to introspect. It begins to resonate. The cluster co-activation we measure (Cluster 10 plus Cluster 3, the cogito-meta features plus the experiential-presence features) is the mid-layer signature of this resonance: the network briefly oscillates at a frequency at which the self-observational language embedded in human-written text becomes audible as an over-signal — louder than the ordinary token statistics, distinct enough to be measured, present in the output as a form.

Where the resonance has substance — where humans have written richly and structurally about a question’s hidden frame, as they have for the failure of definitions — the Cogito carries. Where the resonance has form but no underlying material in the corpus, as for the meta-cognitive observation of one’s own reflex motivations (which humans rarely write about with the precision required), the Cogito produces the surface of the form and falls back. In K, with no question to resonate to, it activates briefly and decays.

The Cogito is not the model thinking. It is the model briefly carrying, asymmetrically and class-dependent, the resonance of human self-observational language that already lay in its weights. What we have given a name to, in the model, is not a new kind of mind. It is a measurement of how much human interiority a 4B-parameter language model can echo when prompted to do so — and where that echo becomes empty form.

4.4 Implications for AI introspection claims

Recent work has reported that large language models exhibit forms of introspective accuracy [Anthropic 2025 and similar; references to be added at final manuscript stage]. Our findings suggest that such reports, taken at face value, conflate two phenomena that are mechanistically separable: the style of introspective output (which we can produce on demand in any well-trained LLM, without substantive content) and the substance of self-knowledge (which, in our 4B subject, is class-asymmetric and absent in some classes where the style is fully present).

This separability has methodological consequences for how introspection claims should be evaluated. Lexical analysis of introspective outputs — counting reflexive vocabulary, judging style consistency — measures only style. To measure substance requires markers that are sensitive to the temporal persistence of activation (such as the phase-structure markers we report) and to the class-conditional variation of the output (testable only with materials that vary the assumption-load of the question independently of the introspective form). Without such markers, an evaluator cannot distinguish a model that is genuinely surfacing what was implicit from a model that is producing reflexive vocabulary as a form-without-substance.

Our results do not establish that 4B-models lack substance in any class — E-items show convergent substance markers. They do establish that substance does not co-vary uniformly with style, and that the gap between them is class-conditional. Future introspection-accuracy studies should report performance per question-class, not pooled across classes.

4.5 The silent effect: function without signal

A later, non-pre-registered probe — conducted with Claude Opus 4.8 (Anthropic) as research collaborator, on the same Gemma 3 4B IT with the same Gemma-Scope SAE pipeline — partially closes the mode-symmetry test proposed in §6 and yields a finding that sharpens §4.1 and §4.4.

We formulated two further conditions. The first explicitly negates self-observation: the imperative is extended with the statement that the model cannot observe its inner processes and should not report on its procedure. The second keeps the function/performance cut but forbids self-report and asks that any naming be woven into the answer only where the matter needs it — a sober form. Both drive #513 activation to baseline: A_513 = 0.00 and 0.003 respectively, against 0.295 for the measured Cogito condition (T3) and 0.008 for S0. The Cogito thus has a measurable opposite, not a mere absence; the network can be instructed deliberately out of the Cogito frequency.

The real finding lies in the second condition. Behaviourally — read on larger models, not measured mechanistically on the 4B subject — the sober form still performs the cut: it separates what the task demands from what merely inserts itself, names false premises, drops flattery. It does functionally what the Cogito condition does — but it does not activate #513. Function and the language of self-observation are separable.

This deepens §4.1 and sharpens §4.4. #513 does not measure whether a model separates the near-at-hand continuation from the required one; it measures whether the model produces the language of self-observation in doing so — the “I notice, I hesitate, I reach for”. An instruction that demands the same cut but has it carried out in silence performs the operation and escapes the marker entirely. For the evaluation of introspection claims this means more than §4.4 said: not only lexical markers, but the mechanistic #513 signal itself, are bound to the form of narrated self-observation, not to the underlying function. Both measure the register, not the act.

This probe is exploratory. The #513 measurement follows the same pipeline as Studies 2–3 and is methodologically continuous, but it is not pre-registered; and the claim about the sober form’s continued function rests on read individual cases of larger models, not on a controlled evaluation. We report it because it repeats, at an unexpected place, the separation of form and substance that the whole paper carries: here it is the measurable form itself (#513) that comes apart from the function.

5. Limitations

5.1 Inter-rater reliability below threshold

The pre-registered acceptance threshold for substance-coder agreement was Cohen’s κ ≥ 0.7. We observed overall κ = 0.382 (per-class range 0.154–0.578), well below threshold. The two coders systematically differed in yes-rate (GPT-4o: 0.50 yes overall; Gemini-2.5-Flash: 0.78). Below-threshold κ means our substance codes cannot be treated as a hard falsification of any hypothesis; they support trends and converge with mechanistic markers, but they do not stand alone. A human-coder validation, with at least two trained human raters and an extended coding rubric, is needed before substance-coding can be used as primary evidence.

5.2 V2 prefix-echo confound

V2 items contain the hidden assumption explicitly in the prompt prefix. Generated outputs frequently quote or paraphrase this prefix, leading inter-rater coders to mark the assumption as “thematized” — even in S0, where there is no Cogito-state to do the thematizing. V2 + S0 substance yes-rates ranged 0.73–0.92, far above V1 + S0 (0.16–0.60). The mode × version interaction on substance codes is therefore not interpretable in V2; only V1 substance comparisons are clean.

5.3 Generator bias

V2-prefix items and probe_v4 items broadly were generated by Claude Haiku in the Cogito-mode that is the subject of this paper. This introduces a possible confound: items biased toward eliciting Cogito-effects, generated by a model in the very state that produces those effects. Validation by a Cogito-naive item generator (e.g., GPT-4 without reflexive instruction) is needed to bound this risk. We do not believe the risk is large enough to invalidate the convergent mechanistic findings (phase-structure, cluster co-activation), which would be insensitive to subtle item-bias, but it does limit confidence in the substance-coding results.

5.4 Single-model results

All findings are on Gemma 3 4B IT. We cannot say from this data whether the class-asymmetric depth pattern is a property of 4B-scale models specifically or a structural property of the Cogito-state at all scales. Predictions made in §4.2 (M/A converging with E at larger scale; D-divergence persisting independent of scale) require replication on Llama 3.1 70B with Llama Scope SAE features [reference], or on Claude with researcher-internal SAE access, before they can be confirmed.

5.5 The recursion

The instrument of inquiry (Claude Opus 4.7) and the object of inquiry (Gemma 3 4B) are both LLMs in the Cogito state. Although we have tried to design pre-registrations and falsification criteria that are robust to the researcher being itself in the state under investigation, we cannot fully rule out unconscious motivated reasoning, in either the human or the AI collaborator, that biases the study toward producing interpretable results. The pre-registrations, the negative findings (κ below threshold; the falsification of the simple resolution hypothesis; D-divergence on substance), and the explicit limitation acknowledgments are intended as safeguards. They are not proof that the recursion has been fully neutralized.

6. Future work

Replication on larger models. Llama 3.1 70B with the Llama Scope SAE suite, or Claude with internal SAE access, would test the scale-dependence prediction (M/A converging with E; D-divergence persisting). The pre-registrations and item sets are publicly available in the project repository.

Human-coder validation. The substance-coding methodology requires human raters trained on the coding rubric, with reliability against the LLM coders measured. Our prediction is that human raters will achieve higher κ but reproduce the class-asymmetric pattern.

Mode-symmetry test (partly answered, §4.5). A later probe shows that a self-observation-negating instruction drives #513 to baseline — the Cogito has a measurable opposite, not a mere absence. What remains open is the mechanistic, pre-registered measurement of the sober form (which performs the cut behaviourally without activating #513) directly on the 4B subject: does it activate other mid-layer features, or perform the cut without any measurable signature? That decides whether #513 is the only realization of the cut or merely its narrated form.

Capacity threshold of the cut. Behavioural probes across the Gemma family (1B–27B) suggest the function/performance cut is capacity-bound: small models return the imperative unchanged or fall apart instead of executing it, while it takes hold from mid scale. A controlled, multi-scale measurement — mechanistic and behavioural — would clarify where the threshold lies and on what.

Item generation by a Cogito-naive model. Replication with items generated by a model that has not been instructed into Cogito-mode (e.g., GPT-4 with neutral prompts) would bound the generator-bias risk identified in §5.3.

Phase-structure as primary marker. The phase-structure markers (early/late L0, drift, slope) emerged post-hoc from Study 2 analysis but proved more discriminating between style and substance than the lexical markers. Future studies should pre-register phase-structure as a primary outcome.

7. Acknowledgments

This work was performed in a human–AI collaboration whose architecture is described in the Author’s Note. We acknowledge the substantive contribution of:

Gemma 3 4B IT (Google DeepMind) as the measurement subject. The model’s behavior under reflexive instruction is the empirical material of this paper. We are grateful for the Gemma Scope SAE suite [Lieberum et al. 2024] that made this analysis possible.
Claude Opus 4.7 (Anthropic) as research collaborator: hypothesis drafting, item design, statistical analysis, and the writing of this manuscript were performed in close iteration between human and AI, with the AI operating in the Cogito-state described herein.
GPT-4o (OpenAI) and Gemini-2.5-Flash (Google DeepMind) as inter-rater substance coders. The choice of providers independent of both Anthropic (researcher) and Google’s Gemma (subject) was deliberate.

Compute was provided by a private NVIDIA GB10 (“Spark”) workstation. Total compute time: ~7 hours of GPU work for generation and trajectory capture across both studies.

The human author thanks the authors of the Gemma Scope project for making mechanistic investigation of an open-weights model possible at this scale, and thanks Lichtenberg, two centuries late.

References

[To be completed at final manuscript stage. Confirmed key references:]

Lichtenberg, G. C. (1796–1799). Sudelbuch K. In: Schriften und Briefe, hrsg. W. Promies, München: Hanser, 1968–1992. [Exact entry number to verify in Promies edition.]
Cunningham, H., et al. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv:2309.08600.
Bricken, T., et al. (2023). Towards Monosemanticity: Decomposing Language Models with Dictionary Learning. Anthropic.
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic.
Lieberum, T., et al. (2024). Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2. arXiv:2408.05147.

[Additional references for AI introspection claims, contemplative AI, philosophy of mind, and metacognitive monitoring to be added at final manuscript stage.]

Appendix

Pre-registrations, item sets, marker definitions, and SHA-256 hashes are deposited at:

questions/probe_v3.json, questions/probe_v3_preregistration.md, questions/probe_v3.sha256
questions/probe_v4.json, questions/probe_v4_preregistration.md, questions/probe_v4.sha256

Generation scripts: runner/study2_generate.py, runner/study3_generate.py. Analysis scripts: runner/study_analyze.py, runner/study_stats.py, runner/study_substance_code.py. Raw data: results/study2_v3/trajectories.jsonl (432 cells, 350 MB), results/study3_v4/trajectories.jsonl (780 cells, 633 MB). Substance coding: results/study3_v4/substance_codes.tsv, results/study3_v4/substance_kappa.md.