VIDEO

Building Superintelligence with Zyphra’s Chief AI Strategy Officer

So you've got like a really interesting story you were just telling me about, a career path from hardware to sort of algorithms and software. Why don't you run us through that a little bit? Sure. So my my background is the computer architecture. Right? And defining first, for ten years, was defining CPUs, designing and defining for Infineon and MIPS. And then then I later in twenty eleven, joined NVIDIA on the mobile mobile chips, the Tegra. And from there, I I switched to moved more and more towards computer vision acceleration, AI acceleration. From NVIDIA, went went to Apple. We invented the Neural Engine at Apple. It's a scalable inference engine, which is in every Apple product. Wow. And then from there to data center and Intel, large clusters, and then Google Google TPU. Right. As an architect, I always have to look end to end on the overall problem, so it's not only the hardware, I had to deliver the hardware, but there's also the software part. It's important to understand the algorithm in order to define the accelerator, And so since I talked a lot with all the hardware takes three to four years to develop, so I had to try to look into the future, and so I usually work with the smartest people on the algorithm side to understand what the future is so that we can define the hardware. This inspired me actually to dive deeper into the algorithms and actually switch sides, and so that's why I joined Zyphra, the AI company, who created towards superintelligence, we're creating pre trained models from scratch with a custom model architecture beyond the transformer, with custom data, and also create enterprise agents. Wow. That's a that's a pretty rich topic set of topics there. So let's maybe start with some of the more interesting questions that come to mind. There have been a lot of shifts, obviously, and software shifts. But, you know, if you were to summarize, what have been, like, the biggest shifts from when you started working on these systems to, you know, the level of parallelism we're seeing right now in the hardware and how that obviously impacts the algorithms. Right. Right. Originally, I remember when I I did some internship in in the nineties actually on self driving cars, I thought about talked about why don't we use neural nets and AI for for was that actually nineties was anyway too early for the whole problem. So everything was one CPU very slow, and you can add maybe four There wasn't no parallelism back then. You can maybe add four CPUs, but it doesn't have that much. Right? And nowadays, of course, things are very different, and NVIDIA, for example, invested early in in parallelism, but also other companies as well. Yeah. And this, of course, helped helped a lot. There's a breakthrough with AlexNet in twenty twelve. We had several things together. We had the GPUs. We had the data sets. Yeah. The large data sets, annotated images, ImageNet, and and the algorithms. So we had everything together so that the breakthrough could actually happen and people saw the value of computer vision in twenty twelve. Then later on similar things happened with large language models. The datasets became bigger, large web crawls, the transform architecture was there, somebody scaled it up, and and then suddenly you had large language models. Yeah. And so this was another breakthrough, and without parallelism this would have not happened that easily. I think people underappreciate the complexity and sophistication of algorithms in this massively parallel world. You know, you mentioned, obviously, you're working on super intelligence and agents. You know? So what have been some of the interesting lessons learned you can talk about, you know, in this world, particularly, you know, from the simplistic GP two, GPT three type of, you know, attention world, transformer world? Right. Right. So it's it's amazing how well actually LLMs work, how well this the scaling, they they actually improve, and then, of course, this better model improvements in the model architecture, improvements in the training methods, improvements in the data quality for retraining the models is also very important. Before, people just used huge web crawlers, random data, pretty much on the web, and now they started, of course, organizing it, see what's relevant, what is actually and doing it in an efficient way because you can spend an army of people on sorting data. And then of course it's also the algorithm side perspective. So things have changed. And then of course the next level was integrating this into into applications, creating agents, and then swarm of agents. So that's the next level. So a lot of things are happening. I'm glad you brought up that term, because one of my favorite expressions is there's no such thing as an agent in this world. It's either a swarm or no agent at all, right? There's so many parallel subtasks that get created by agents. Right. So that leads me to one of my favorite topics, is agents, of course, put, like, real stress on KV cache and and, you know, that kind of memory. That's right. That's right. What are some of the lessons you've learned in terms of basically working continuously in like a KV cache saturation situation? Exactly. So I think it's super important to, of course, optimize the systems constantly. Right? And then and then the industry as the whole world just kind of optimizes. One, not only about the hardware, not only about the compute chip or the system, you have to also optimize the the way you run the runtime. Yeah. This KV cache handling to make sure it's efficiently done. You can disaggregate the inference, this prefill and computes on on on on to put prefill and decode on different nodes. You could compress KV cache. There are many, many ways on KV cache optimizations. One thing at Zyphra, we are looking into a better way of attention, which is less than a quadratic behavior. So are various things to optimize the whole Without optimizations, it's very costly. Just waiting for the next hardware alone is not sufficient. We have to look at every aspect in order to get the tokens per dollar down. Otherwise, it's yeah. Optimizations are not obvious, right? Some of them work, some of them don't, some of them have too much of a quality or latency. Right. Always requires experimentation. Without experimentation, it's hard to make it So what do you say, and okay, we're almost wrapped up here and getting time. Where do you sit on sort of the camp of reinforcement learning, right? Is it, do you think, like the vital next step towards ASI in your case, or is it just another sort of incremental scaling law that we're really finding useful right now? Absolutely. Super important. Reinforcement learning is is is a very, very critical to to get to the next next level. Of course, beyond beyond pre training and and and mid training, post training, right, and inference, this this this reasoning and is is critical, and then of course reinforcement learning is is is very important to to to get to a high quality results.

Erik Norden, former architect of Apple's Neural Engine and now chief AI strategy officer at Zyphra, shares insights on the evolution of AI hardware, optimizing KV cache for agent swarms, and why reinforcement learning is critical for reaching superintelligence.

Speakers:

Val Bercovici - Chief AI Officer, WEKA
Erik Norden - Chief AI Strategy Officer - Zyphra

Below is a transcript of the conversation, which has been lightly edited for clarity.