Skip to content

poolside's journey to AGI

Pengming WangFounding Engineer @ poolside
Nikolay ZinovFounding Engineer @ poolside
Eiso KantCTO & Co-founder @ poolside
Jason WarnerCEO & Co-founder @ poolside

When we founded poolside in April 2023, the narrative in the industry was that all we needed to reach AGI was to scale up language modelling. And while we found ourselves agreeing with the importance of scaling compute and the effectiveness of next token prediction, we held the belief that the most important scaling axis would become Reinforcement Learning (RL). Research from the last two years has increasingly shown this to be true.

In this post we’ll expand on the core beliefs we’ve held since our founding, beliefs in which our conviction has only grown as our research has progressed:

  • Scaling compute will continue to be the key to reaching AGI and beyond.
  • Human language is special: it holds the key to unlocking generalization and reasoning; and language modelling is far from done.
  • RL is the most important scaling axis because it offers a path to both: the ability to learn from new experiences, and a method to reverse engineer humanity’s knowledge.

The internet is a collection of human understanding, experience and thoughts compressed into language. And language is an efficient form of communication, but when information is compressed in this way, we lose access to the human thinking and real-world inputs that preceded the final product.

With models being incredibly hungry for data to learn, the web poses two limitations. First, most of the web only represents the output product (the code, the knowledge, etc.) but not the latent inputs and reasoning behind it. Second, its supply of challenging, high-quality material is finite. This scarcity is why Ilya Sutskever and others call web data “the fossil fuel of AI.”1

Enter Reinforcement Learning. RL offers a promise for us to overcome both the finite nature of the web and a way to decompress the web into the thought processes that created it.

Learning from new experiences at scale

Reinforcement learning is learning from experience—trial-and-error that keeps generating fresh data.

The domain of software engineering is a proxy for general intelligence that provides a rich setting for RL with verifiable rewards that we know how to scale effectively. If the web is our fossil fuel, data from interactions with the real world is our renewable energy.

When we founded poolside in April 2023, we built it around the conviction that reinforcement learning would become the most important scaling axis for model intelligence and capabilities, and we chose coding as the first major capability we wanted to achieve. Software development demands broad knowledge, deep planning and intricate reasoning, yet every experiment can run easily on a computer without physical interactions with the real world. Code supplies a huge variety of tasks with objective, automated feedback—compilation results, unit-test passes, performance benchmarks—so models constantly know when they’re right or wrong and can keep producing the data to learn from.

Internally, we already orchestrate millions of coding environments seeded from a vast corpus of open‑source software spanning Python, Rust, Java, Go and other languages. Each sandbox reproduces the full build‑and‑test loop, giving agents a safe playground to refactor, debug and extend code while receiving immediate, unambiguous signals from compilers, linters and test suites. This has been our work for the last two years around Reinforcement Learning from Code Execution Feedback.

Over time these virtual gyms will be complemented by an ever‑larger fleet of agents operating in live, production settings—running on production servers, CIs, autonomous sandboxes and other environments.

Scaling to millions of agents in real‑world environments increases the volume and variety of interaction data by orders of magnitude. Each agent contributes compile logs, test results, runtime traces and user‑level performance metrics: clear signals that let the system iterate faster and push toward superhuman engineering capabilities. The bigger this feedback loop grows, the faster models improve.

Decompressing humanity’s knowledge

Even though agents will soon emit more tokens than all human authors combined, sheer volume is not the same as informational density. A graduate-level physics text compresses centuries of discovery and months of an author’s reasoning into a slim stack of pages; the average synthetic trace, by contrast, is a verbose record of every exploratory branch.

This means vast insights remain locked within human‑generated data. Continuing the energy analogy: if simply consuming human data for next-token prediction is like burning fossil fuels, then systematically extracting its hidden potential is akin to harnessing nuclear power.

We need to improve generalization to learn more from the same data. Our view is that a good way to achieve this is to pour more compute into RL exploration, enabling the model to try different thought processes that explain data—essentially, the model learns by thinking. Doing this in the most generalized and abstract way without narrow priors is, in our opinion, the most promising frontier of AGI research. To learn more about our thinking and progress on this, you’ll have to join our team 😉.

Modalities beyond language

Intelligence is not confined to language: it also spans vision, spatial understanding and physical interaction. We hold the view that once a system can already understand and reason about the world in language, achieving (super)human capabilities in the other modalities becomes dramatically easier.

Natural language is the densest abstraction layer we possess. By compressing high‑dimensional sensory streams into a symbolic, low‑bandwidth channel, it forces a learner to build rich internal models of objects, causality and context. With such models in place, extending competence to new modalities is largely a matter of translating between representations: linking images to captions or action plans to motor commands, rather than reinventing perception from scratch.

Since language is inherently compositional, its syntax and discourse structures train a model to manipulate hierarchies: variable binding, recursion and long‑range dependency tracking come for free. Those very skills transfer to spatial reasoning—describing 3‑D scenes, decomposing kinematics or planning multi‑step tasks—without needing additional data at internet scale.

The second advantage of focusing on language is data efficiency. Web‑scale text corpora coupled with self‑supervised objectives allow us to pre‑train colossal models in a less resource-intensive manner. When those models are later trained together with images, video or robotics demonstrations, they already contain a rich prior about the world, slashing the sample complexity of multimodal learning.

Building toward AGI

Building toward AGI isn’t about dumping ever-larger piles of text into ever-larger neural networks; it’s about decompressing all human experiential learning and thinking from our existing finite data, introducing agent experiential learning to tap a never-ending pool of data, and methodically and precisely applying compute. This is all in the service of creating the pathway to AGI:

  • Reinforcement Learning-driven exploration provides diverse trajectories from new experiences of trillions of real-world interactions.
  • Reinforcement Learning also offers a path to deeper learning from the same complex and compressed web corpus, pressuring the model to generalize, transforming raw FLOPs into meaningful searches over hypotheses.
  • Scale amplifies this loop: more compute and more agents mean more experiments, richer feedback, and faster convergence

We are convinced that RL holds the key to the next phase of model capabilities. Our job at poolside is to design and run this energy system:

  • The fusion reactor: extracting the remaining energy from the data that already exists and turning it into progress.
  • The wind turbine: using RL to harvest the energy of new, fresh data generated through learning and exploration.

We are maximizing these rich energy sources, setting abstract and general objectives, harvesting rich interaction traces and keeping the learning cycle spinning until the model’s reasoning surpasses our own.