VIDEO

Building Superintelligence with Zyphra’s Chief AI Strategy Officer

Erik Norden, former architect of Apple's Neural Engine and now chief AI strategy officer at Zyphra, shares insights on the evolution of AI hardware, optimizing KV cache for agent swarms, and why reinforcement learning is critical for reaching superintelligence.

Speakers:

  • Val Bercovici - Chief AI Officer, WEKA
  • Erik Norden - Chief AI Strategy Officer - Zyphra

Below is a transcript of the conversation, which has been lightly edited for clarity.

Transcript

00:00

From Hardware Architecture to AI Algorithms: A Career Evolution

Val Bercovici: You’ve got a really interesting story you were just telling me about, a career path from hardware to algorithms and software. Why don’t you run us through that a bit?

Erik Norden: Sure. So my background is computer architecture. For 10 years I was defining CPUs, designing and defining for Infineon and MIPS. Then in 2011, I joined NVIDIA on the mobile chips, the Tegra. From there I moved more and more toward computer vision acceleration, AI acceleration. From NVIDIA I went to Apple, where we invented the Neural Engine at Apple. It’s a scalable inference engine which is in every Apple product. Then from there to data center at Intel for large clusters, and then Google TPU.

As an architect, I always have to look at the end-to-end and the overall problem. So it’s not only the hardware I have to deliver. There’s also the software part. It’s important to understand the algorithm in order to define the accelerator. Since hardware takes 3 to 4 years to develop, I had to try to look into the future. I usually worked with the smartest people on the algorithm side to understand what the future is so we can define the hardware. But this inspired me to dive deeper into the algorithms and actually switch sides. This is why I joined Zyphra. The AI company was created toward superintelligence. We create pre-trained models from scratch with custom model architecture beyond the transformer, custom data, and also create enterprise agents.

Val: That’s a pretty rich set of topics.

01:52

Parallel Computing in AI: From CPUs to Massive GPU Clusters

Val: Let’s maybe start with some of the more interesting questions that come to mind. There have been a lot of shifts, obviously, in hardware and software. If you were to summarize, what would have been the biggest shifts from when you started working on these systems to the level of parallelism we’re seeing right now in the hardware and how that impacts the algorithms?

Erik: Originally, I remember when I did an internship in the 1990s on self-driving cars, and I talked about why don’t we use neural nets and not AI for it. But the ‘90s was way too early for the whole problem. Everything was one CPU, very slow. You could maybe add four CPUs, but it doesn’t help that much. Nowadays, of course, things are very different. NVIDIA, for example, invested early in parallelism, but also other companies as well. This helped a lot with the breakthrough with AlexNet in 2012 where we had several things together. We had the GPUs, we had the datasets—large datasets, annotated images, ImageNet—and the algorithms. We had everything together so that the breakthrough could actually happen and people saw the value of computer vision in 2012.

Then later on, similar things happened with large language models. The datasets became bigger—large web crawls. The transformer architecture was there. Somebody scaled it up and then suddenly you had large language models. So this was another breakthrough, and without parallelism this would have not happened that easily.

03:27

Scaling Laws and Data Quality to Build Better Foundation Models

Val: I think people underappreciate the complexity and sophistication of algorithms in this massively parallel world. You mentioned you’re working on superintelligence and agents. What have been some of the interesting lessons learned you can talk about in this world, particularly from the simplistic GPT-2, GPT-3 type of attention world, transformer world?

Erik: It’s amazing how well LLMs work, how well with the scaling they actually improve. And then of course, with better model improvements in the model architecture and improvements in the training methods, improvements in the data quality for pre-training the models are also very important. Before, people just used huge web crawlers, random data pretty much from the web. Now they started organizing it to see what’s relevant, what is not so good to use. Some horror movies, maybe not useful. Horror books, you might not want to use.

And then of course replication, duplicate data. So having a good data pipeline is key and doing it in an efficient way, because you can spend an army of people on sorting data. Then of course there’s also the algorithm perspective. Things have changed. And then the next level of integrating this into applications, creating agents and then swarms of agents. That’s next-level. So a lot of things are happening.

04:50

Agent Swarms and KV Cache Optimization: Managing Memory at Scale

Val: I’m glad you brought up that term because one of my favorite expressions is there’s no such thing as an agent in this world. It’s either a swarm or no agent at all. There are so many parallel subtasks that get created by agents. So that leads me to one of my favorite topics, which is agents put real stress on KV cache and that kind of memory. What are some of the lessons you’ve learned in terms of working continuously in a KV cache saturation situation?

Erik: I think it’s super important to optimize the systems constantly. In the industry, the whole world is optimizing not only the hardware, not only the compute chip or the system. You have to also optimize the way you run the runtime, the KV cache handling to make sure it’s efficiently done. You can disaggregate inference with prefill and compute, and prefill and decode on different nodes. You could compress KV cache. There are many ways for optimizations. One thing at Zyphra we are looking into is a better way of attention, which is less than quadratic behavior. There are various things to optimize the whole system. Without optimizations, it’s very costly. Just waiting for the next hardware alone is not sufficient. We have to look at every aspect in order to get the tokens per dollar down.

Val: Those optimizations are not obvious, right? Some of them work, some of them don’t. Some of them have too much of a quality to latency tradeoff.

Erik: Almost always requires experimentation. Without experimentation, it’s hard to make it work.

06:25

Reinforcement Learning as the Path to Superintelligence

Val: We’re almost wrapped up here on time. Where do you sit on the camp of reinforcement learning? Is it the vital next step toward ASI in your case, or is it just another incremental scaling law that we’re finding useful right now?

Erik: Absolutely. Super important. Reinforcement learning is very critical to get to the next level. Of course we are beyond pre-training and training—post-training. And inference with reasoning is critical. And then of course reinforcement learning is very important to get high-quality results.

Related Resources