AI for Science at the Royal Society: Six Big Questions
Physics, interpretability, foundation models, AI scientists, the trust problem, and real-world use
On 31 March 2026, over 260 researchers, practitioners, and government partners gathered at the Royal Society in London for the AI for Science conference that we organised at Alan Turing Institute. The day featured talks spanning astrophysics, materials science, climate modelling, and fusion energy, lightning talks and posters from early career researchers. A brief overview of the event is available here, which contains a short video with thoughts from the keynote speakers. The event closed with a panel discussion that I had the pleasure of chairing.
The panel brought together five researchers working at the frontier of AI for science: Anna Scaife (University of Manchester), an expert in AI for data-intensive astrophysics; Aron Walsh (Imperial College London / Cusp AI), a specialist in AI-driven materials design; Scott Hosking (Alan Turing Institute), who leads work on AI for weather and climate; Lorenzo Zanisi (UK Atomic Energy Authority), working on AI approaches to fusion energy; and Miles Cranmer (University of Cambridge), focused on foundational AI models and the extrapolation problem. Despite working across very different scientific domains, the panellists converged on a remarkably consistent set of concerns and opportunities.
The discussion was held under Chatham House rules, so I won't attribute specific remarks below. Instead, what follows is a synthesis of the key themes that were discussed.
The six themes explored during the panel discussion: physics, interpretability, foundation models, AI scientists, trustworthiness, and operationalisation.
Should we let the data speak for itself?
One of the most persistent debates in AI for science is whether to bake physics into models or let the data do the talking. The panel explored both sides with nuance.
On one hand, purely data-driven approaches have delivered real gains. In climate science, for instance, models trained directly on observational data have improved accuracy without any explicit physics. When you're working within the distribution of your training data, letting the data speak can be remarkably effective.
But extrapolation is a different story. Without physical constraints, a model's predictions outside the range of its training data are not directed. Encoding symmetries and physical laws into model architectures narrows the design space, helping models generalise to regimes they've never seen. This matters enormously for scientific discovery, where the interesting questions almost always lie beyond the boundaries of existing data.
The overhead is real, however, since physics-enhanced models can be slower to train and run. But the consensus was that for discovery and extrapolation, the tradeoff is often worth it.
The discussion also surfaced an important practical angle: many scientific domains are sitting on legacy simulation codes — sometimes decades old — that represent enormous investments of physical knowledge. Agentic AI tools could help rewrite and modernise these codebases, making them more scalable and eventually integrable with AI pipelines. The physics is already encoded; it just needs to be unlocked.
Interpretability as science, not just engineering
A theme that ran through the entire day, and crystallised in the panel, was the relationship between interpretability and scientific understanding.
The point was made sharply: science is fundamentally about humans understanding the universe. If we can't interpret what a model is doing, we're doing engineering, not science. There's nothing wrong with engineering, of course, but the distinction matters when the goal is discovery rather than prediction.
Beyond the philosophical point, there's a practical one. When you understand what a model has learned, you can help it generalise. If you know why a model works within its training distribution, you have a better chance of guiding it beyond that distribution. Without interpretability, you're flying blind when it matters most.
An intriguing parallel was drawn with AI alignment. In the language model world, we train for alignment — making models behave in ways that accord with human values — but this can come at the cost of raw performance. Aligning scientific AI models with human understanding might impose a similar tradeoff: models that are more interpretable might be slightly less accurate. But the panel's view was that this may be a price worth exploring, because the alternative — powerful models we can't interrogate — limits the kind of science we can do.
Foundation models: better priors, not perfect ones
Foundation models are sweeping through the sciences, and the panel didn't shy away from the hard questions about what happens when pretrained biases propagate into downstream tasks.
The concern is real: pretraining imposes a prior on everything downstream. Systematic biases and shifts in variance have been observed in practice, particularly in fields like astrophysics where statistical rigour is paramount. If the foundation model carries a systematic error — say, a bias over a particular region of the sky or ocean — that bias gets baked into every fine-tuned model built on top of it.
But a counterpoint was raised. An uninitialised model — one with random weights — is also a prior, and it's actually a bad prior and a strong one. Foundation models, for all their imperfections, typically provide a better starting point than random initialisation. The question isn't whether to use pretrained models, but how to understand and mitigate the biases they introduce.
Several directions were discussed. Validation pipelines need to be rigorous and domain-specific. Foundation models for science need to be more probabilistic, providing calibrated uncertainty estimates rather than point predictions. And more work is needed on understanding exactly how pretraining biases manifest in downstream tasks. There was genuine optimism that these are solvable problems.
On the question of transferability — whether a foundation model trained on, say, weather dynamics could be useful for automotive aerodynamics — the panel was pragmatic. Fine-tuning will always be required. The metadata matters enormously: two systems might share the same underlying partial differential equations but differ in boundary conditions, resolution, and mesh geometry. The promise of foundation models isn't that they'll work out of the box across domains, but that they'll reduce the number of simulations needed to get started on a new problem.
The rise of the AI scientist
Perhaps the most animated discussion concerned the emergence of AI as an active participant in the scientific process — not just a tool for analysis, but an agent that can design experiments, generate hypotheses, and navigate literature.
The excitement was palpable. Agentic AI is already transforming how scientists write code, process data, and manage workflows. The near-term vision — AI systems that can perform sequences of tasks, summarise results, and iterate — is essentially here. The harder question is what comes next: AI systems that can define problems, not just solve them.
This raises a fundamental question about trust. Would you trust an AI system to autonomously run an experimental campaign in your lab to discover a new material? The reasoning capabilities of current models are improving rapidly, but the panel was cautious about full autonomy. The consensus leaned toward "co-scientists" rather than autonomous agents: AI systems that dramatically expand the range of literature a researcher can engage with, suggest connections across disciplines, and accelerate exploration, while humans remain responsible for the scientific judgement calls.
One of the most thought-provoking exchanges concerned what this means for the structure of academia. There was honest acknowledgement that early career researchers — PhD students and postdocs — are anxious about being displaced. The panel's view was more nuanced: PhD students are likely to become more productive, not obsolete, as AI amplifies their capabilities. If anything, the risk may be greater for mid-career roles, creating a potential gap between highly productive junior researchers and senior scientists who set strategic direction.
The question of credentialing AI-assisted discoveries also came up. If an AI co-scientist plays a significant role in a discovery, how do we attribute credit? How do we audit the decision process? The point was made that documenting the AI's reasoning chain is essential — not just for attribution, but because without it, there's no way to distinguish genuine discovery from hallucination.
Trustworthiness: the make-or-break challenge
If there was a single thread that wove through every topic, it was trustworthiness. The panel returned to it again and again, from different angles.
The technical challenge is substantial. Uncertainty quantification for AI predictions remains poorly understood. There is no single, standard method for obtaining calibrated confidence estimates from AI models. Until we solve this, there's a fundamental gap in the trustworthiness of AI for science. In fields like fusion energy, where AI models inform decisions about multi-billion-pound reactor designs, you can't simply say "the AI told us to do that." You need to know how wrong the model might be, and that uncertainty needs to be calibrated.
A practical framework emerged from the discussion. Several panellists advocated for a "hero run" approach: use AI surrogate models for cheap, broad exploration of the design space, then go back to high-fidelity physics simulations for the specific design points that matter most. The AI doesn't replace the physics — it makes the physics tractable by narrowing the search space. But critically, the fallback to the base physical model must always be available.
Real-world use and impact
The operationalisation challenge is equally important. Real-world deployment of AI for science often stumbles on practicalities that researchers don't encounter in the lab. Models developed with abundant GPU clusters may not run on the edge computing devices available in the field. Weather models validated on European data may not integrate smoothly into existing forecasting workflows in West Africa. Co-design with end users from day one was emphasised as essential — not just for building trust, but for building systems that actually work in context.
An interesting link was drawn between AI safety and AI for science. While these communities have developed largely independently, there's growing recognition that the tools of AI safety — steering, interpretability, evaluation frameworks — have direct applications in scientific AI. One example cited was recent work using activation steering techniques from the safety literature to control the behaviour of physics simulations. The science community doesn't need to reinvent the wheel.
Looking forward
The panel closed with each member sharing their view on the future of AI for science. The mood was overwhelmingly positive, tempered by healthy realism.
The breadth of what's now possible was a recurring note. AI is uniting scientists across disciplines in ways that weren't feasible before, creating shared methodologies and a common language between fields as diverse as astrophysics and materials science. The ability to rapidly test ideas and get feedback is transforming what a PhD looks like. Open-source tools developed for one domain are being picked up and extended by others.
But there were cautions too. One panellist voiced concern that the contribution of AI to science is sometimes overstated, partly driven by the PR machines of large AI labs. When the narrative gets ahead of the reality, it can cause governments to misallocate funding, assuming capabilities that don't yet exist. And the compute constraints facing academic researchers are real — though they also force a kind of creativity that can lead to insights that wouldn't emerge in environments where compute is unlimited. The UK's national research computing resources, including IsambardAI and Dawn, were noted as important assets for keeping academic AI for science competitive.
One of the key learnings from the original Scientific Revolution is that progress accelerates dramatically when we close the feedback loops between fields. AI and science are approaching that kind of tipping point. Not just AI feeding into science, but science feeding back into AI — informing architectures, training methodologies, and evaluation frameworks. We're not quite there yet, but we're on the cusp. Events like this one are part of closing that loop.
Ideas my own. Writing a collaborative effort with AI.



the AI that believes vs. the AI that knows.
https://leebloomquist.substack.com/p/the-ai-that-knows-vs-the-ai-that?utm_campaign=post-expanded-share&utm_medium=web