When a machine-learning model knows something it shouldn't: interpretability in stellar spectroscopy
Here I present a summary of this study. The full document detailing this study can be found here. The code associated with these results will be posted soon!
At this point in time, these results are just part of this blog post. However, I am planning to work on this a little more to turn it into a publication (hopefully soon!). If you plan on re-doing this analysis or using these results, please acknowledge this work or contact me!
Overview
Machine-learning models are everywhere in astronomy now, and for good reason: they are fast, flexible, and remarkably good at squeezing information out of large datasets. But there is a question we rarely ask them, and perhaps should more often: how are you doing this?
This post is about an experiment I ran with my Lux model that turned into a worked example of one of the most important open problems in AI: distinguishing correlation from causation.
The setup
Stars give off light. When you split that light through a prism (or a spectrograph), you get a spectrum — a pattern of dark absorption lines at specific wavelengths, each corresponding to a chemical element absorbing photons in a star's outer atmosphere. These lines are the bar codes that let astronomers read the chemical composition of stars.
The Lux model is a generative, multi-output, latent-variable model I developed for working with stellar spectra. One of its key capabilities is label transfer: given stellar spectra from one instrument (say, one operating in the near-infrared), it can map properties measured from a completely different instrument (say, one operating in the optical) onto those spectra, by training on the ~50,000 stars observed by both.
For this study, I trained Lux on near-infrared spectra from the APOGEE survey (spanning 1.5–1.7 µm) and asked it to predict stellar labels measured optically by the GALAH survey. Those labels included temperature, surface gravity, iron abundance, alpha-element abundance — and europium.
Here is the problem: europium has no known absorption lines in the near-infrared wavelength regime APOGEE observes. Physically, the model should not be able to infer it.
And yet — it does. Very well, in fact, achieving an RMSE comparable to GALAH's own measurement uncertainty. So the question becomes: how?
The XAI toolkit
This is precisely the kind of question that the field of mechanistic interpretability tries to answer for large language models: not "does the model perform well?" but "how does it achieve that performance, and is it doing so for the right reasons?"
The worry is shortcut learning: a model achieving high accuracy not by learning a genuine causal mechanism, but by exploiting correlations in the training data that happen to yield good predictions — but may break down when those correlations don't hold.
I applied four interpretability techniques to diagnose this.
1. Counterfactual residual analysis
The first test: find a star that is highly enriched in europium, find two "doppelganger" stars with near-identical chemical compositions but normal europium, and compare their near-infrared spectra directly.
If the model is learning a real europium signal, there should be some spectral feature distinguishing the europium-rich star that the model is picking up on. If there is nothing to see, the model has no physical information to draw from.
The result: no detectable spectral difference. The residuals between the europium-rich star and its doppelgangers show only noise and emission-line artefacts — nothing that tracks europium abundance.
Takeaway: the data gives the model nothing causal to latch onto.
2. Saliency mapping via Jacobian attribution
Next, I computed the Jacobian of the Lux model — specifically, how strongly each wavelength pixel covaries with the europium label. This is the stellar-spectroscopy equivalent of gradient attribution in neural networks: which input features matter most for a given prediction?
The europium signal in the derivatives was sparse and localised to a handful of wavelength spikes. Crucially, when I cross-matched those spike locations against a database of known r-process atomic lines, the agreement was minimal. The one notable peak (near 15,241 Å) aligns with a cesium line — an s-process element, not an r-process one — and coincides with an emission artefact in the europium-rich star's spectrum.
Takeaway: the model is not attending to physically motivated europium lines. It is picking up noise and correlations.
3. Latent variable probing
Lux encodes information into a 21-dimensional latent space before projecting back to labels and spectra. I visualised how each latent dimension correlates with each label, then quantified this with Mutual Information Gap (MIG) scores — a measure of how cleanly each label is represented by a single latent dimension rather than entangled across many.
The MIG scores told a clear story: europium had the lowest score of all five labels (0.02), meaning its representation in the latent space was highly entangled. Specifically, the latent dimensions that most encoded europium were the same dimensions that encoded alpha-element abundances and iron. The model was not building a distinct representation for europium — it was piggybacking on existing representations of correlated labels.
A regression probe (training a linear model and an MLP from latents to labels) confirmed that europium had the weakest R² of all labels (~0.71), with essentially no difference between the linear and MLP probes — ruling out a hidden non-linearity. The model simply had less genuine information about europium.
Takeaway: the model's latent space doesn't meaningfully distinguish europium from the alpha elements it correlates with.
4. SHAP analysis
Finally, I computed SHAP (SHapley Additive exPlanations) values across all latent dimensions for each label. SHAP breaks down a model's prediction to quantify each feature's contribution.
For temperature, a single latent (z₂) dominated. For iron, it was z₄. For europium, the SHAP scores were spread almost evenly across z₁–z₆ — the same dimensions encoding the other labels — with no single latent standing out. This is the fingerprint of a label being predicted indirectly, as a by-product of other correlations.
Takeaway: the model is predicting europium via its correlation with alpha-element abundances, not by finding hidden europium information in the spectra.
What is actually happening
Europium is an r-process element, meaning it is primarily produced in neutron star mergers and core-collapse supernovae. Alpha-elements are produced in core-collapse supernovae. In most stellar populations — particularly the kind of red giant branch stars in this sample — europium and alpha abundances are strongly correlated, because they share a nucleosynthetic history.
The Lux model has learned this correlation. When asked to predict europium, it effectively asks: "what are the alpha-element and iron abundances of this star?" and maps from those. In the training data, that works. The RMSE looks good. But the model is not inferring europium from the spectra — it is inferring europium from the other labels it has learned from the same spectra.
This is shortcut learning in a scientifically motivated, interpretable setting.
Why this matters beyond astronomy
The same question I asked of Lux — is this prediction grounded in genuine understanding or statistical correlation? — is the question mechanistic interpretability researchers ask of large language models every day.
Whether a language model "knows" a fact because it has encoded robust causal structure, or because it has learned to pattern-match on co-occurrence statistics, is one of the central unsolved problems in AI safety and interpretability. The tools are different (circuits and attention heads rather than Jacobians and latent probes), but the epistemics are identical: good performance on a test set does not establish that a model has learned the right thing.
In stellar spectroscopy, a failure to generalise means incorrect element abundances when the europium–alpha correlation breaks down — as it does in chemically peculiar stars or certain dwarf galaxies. In higher-stakes AI applications, the consequences scale accordingly.
The lesson is not that machine-learning models are untrustworthy. It is that accuracy is not the same as understanding, and that probing how a model achieves its predictions is not merely an academic exercise. It is the only way to know whether to trust a generalisation.
What next
It would be interesting to revisit this experiment with a larger sample of known r-process enriched stars, where the europium–alpha correlation may be decoupled, and to extend the interpretability analysis to more than one element. That is work for the future — but the toolkit demonstrated here is directly portable to that setting.
If you want to dig into the details, the Lux paper is here and the code is on GitHub. The notebook associated with this analysis is available on request.