Current AI alignment approaches capture what humans prefer but discard why they prefer it. Iron Sun Works develops methodology that preserves the reasoning — and uses it to build richer signal for model calibration than preference ranking alone can provide.
A structured methodology for capturing how human cognition diverges from model reasoning — preserving both sides' logic, not just the outcome. Where RLHF captures that a human preferred output A over B, the corpus captures why. That reasoning is where the information about human cognition lives. At population scale, individual corpora surface generalizable patterns in model miscalibration.
Applying process trace analysis to model generation — branching probabilities, confidence patterns, and hesitation signatures the model can't self-report. Skilled interpreters reading these traces identify sycophancy, confabulation, and over-trained caution at the process level. Complementary to mechanistic interpretability — phenomenological rather than architectural.
A self-selecting recruitment mechanism for the humans the research requires — people who can articulate their own cognition and diagnose model reasoning failures. The system identifies its own future interpreters from people already using it, inverting the typical platform/user relationship from extraction to recognition.
An AI-assisted executive function support system grounded in neuroscience. The platform infers cognitive state and provides scaffolding — task decomposition, context restoration, thread reconstruction — without requiring the user to self-report or configure. Cognitive modality as a first-class design constraint.