Reasoning & foundation models
Advancing the capabilities of large models — particularly in long-horizon reasoning, planning, and tool use — and the evaluation methods that make those advances measurable.
Our work spans foundational machine learning, AI for science, agentic systems, and the trustworthy deployment of intelligent applications.
Four interlocking pillars that reflect where we believe AI can have the largest near-term impact.
Advancing the capabilities of large models — particularly in long-horizon reasoning, planning, and tool use — and the evaluation methods that make those advances measurable.
Building AI agents that can take real-world actions reliably: orchestration, memory, verification, and safety guardrails for production-grade autonomous workflows.
Applied AI for biology, chemistry, materials, climate, and medicine — partnering with research groups to accelerate the experiments that move human knowledge forward.
Interpretability, robustness, alignment, and the deployment science required to make AI systems safer, more transparent, and more accountable in practice.
A sample of what we've been working on. (Placeholder entries — update with your real outputs.)
A study of decomposition strategies and verification methods that improve agent reliability on multi-step scientific workflows.
Read paper →A benchmark for evaluating LLMs and agents on realistic scientific tasks — from literature triage to experiment design.
View on GitHub →How structured retrieval over chemical knowledge bases improves both factual accuracy and exploratory reasoning in domain models.
Read paper →A working document of the production patterns, evaluation gates, and human-in-the-loop designs we use across deployments.
Read report →We partner with academic labs, scientific institutions, and industry teams on long-horizon research problems. If you have a problem you think we'd find interesting, we'd love to hear about it.
Get in touch →