VIDEO

The Future of Frontier Models And What They Will (And Won’t) Do Next

WEKA convened a panel of industry experts at World Summit AI Amsterdam to discuss what's accelerating frontier AI — and what's quietly holding it back.

Speakers:

  • Lauren Vaccallero, Chief Marketing Officer at WEKA
  • Georgia Channing, AI for Science Team Lead at Hugging Face
  • Marzieh Fadaee, Head of Cohere Labs
  • Val Bercovici, Chief AI Officer at WEKA


Below is a transcript of the conversation, which has been lightly edited for clarity.

Transcript

Setting the Stage: What’s Next for Frontier Models

Lauren Vaccarello: Today we are going to be talking about frontier models: What’s next for the most advanced AI systems. We are thrilled to have an incredibly esteemed panel with us today.

Immediately to my left is Georgia Channing. She is the machine learning for science lead at Hugging Face. She works on enabling scientific discovery with AI and building tools for scientists in an open-source community. She’s been building in biotech, fusion engineering and materials discovery with Hugging Face. She has her Ph.D. in computer science from the University of Oxford, where she worked on multi-agent methods and distributed training.

To her left, we have Marzieh Fadaee. She’s the head of Cohere Labs, where she leads research on fundamental problems in artificial intelligence. Her work spans multilingual language models, data-efficient learning, model evaluation, and trustworthy AI, with a focus on building systems that are robust, inclusive, and globally impactful. She holds her Ph.D. from the University of Amsterdam, where she conducted foundational research on neural machine translation.

And last but not least, we have Val Bercovici. Val is the chief AI officer at WEKA, where he helps AI builders advance their enterprise and agentic AI research and innovation. He has extensive experience in the infrastructure industry. He’s been the CTO at NetApp and SolidFire. He also co-created the Cloud Native Compute Foundation, which is the home of Kubernetes. Incredible group of panelists with us today.

What Emerging AI Capabilities Are Most Exciting Right Now?

Lauren: First up, I want to ask each of you: What emerging capabilities in frontier models excite you the most? Georgia, do you want to kick us off?

Georgia Channing: Sure. As you mentioned, I have a bit of a science flavor, and I’m going to bring that here as well. The long context that we’re now able to achieve means we can consume so much more scientific knowledge than we ever could before. Say you want to synthesize a molecule. You need to go through hundreds of papers to find the information on how to synthesize it, and that’s something we’re only just getting to. It’s really cool.

Marzieh Fadaee: I think AI for science is a super interesting area that I’m very excited to see. A lot of more recent works — like Hugging Face and also new startups — are now working on problems that really impact humanity in a more direct way, like carbon capture and the environment. That is definitely something I look forward to seeing, where the capabilities go with these models. I’m also very curious to see what capabilities we haven’t thought about and what use cases we haven’t really looked at so far that we will have in a few years. When we look back at five years ago and what this technology could achieve, some things we do today were not really an option or not possible. So that realm — what is still something that’s just too hard to even think about — that is very exciting to me.

Val Bercovici: For me, the seasons of AI this year have been fascinating. We started the year transitioning from a lot of non-reasoning pretrained models to reasoning models in the spring. Then it was like the summer of coding agents, with agents becoming really real, as I mentioned in the keynote a couple hours ago. What’s most exciting to me right now are the advances in reinforcement learning, the real fusion of training and inference in reinforcement learning loops and episodes. If you spend time with a lot of frontier lab people, they’re seeing the direct path between reinforcement learning and AGI. We don’t know if that’s really going to happen, and there might be a couple of exits along the way, but there’s real material progress toward AGI.

New AI Use Cases That Weren't Possible a Year Ago

Lauren: As each of you has said, how advanced we’ve gotten even in the last year, in the last 18 months, is completely incredible. Are there any other new applications or use cases you’re seeing that are possible today that weren’t even possible a year ago? Marzieh, what do you think?

Marzieh: I think many math and code capabilities we see today — the more standardized ways of evaluating through games or math competitions like Olympiads — that is one area I think has been a little bit surprising. But there’s also just more things that maybe we got used to very quickly. We now expect the models to have long conversations, context switch, talk about completely different topics and areas, and the models are doing quite well there. This is related to long context, but also just really building all-around general models that are capable of knowing what is important — what you have to use from long memory, what you have to retrieve from short memory. This is something that is very exciting that we can do for the most part now, and I’m hoping this will also unlock new use cases for this technology.

What Is Holding Frontier AI Models Back?

Lauren: Slightly different question for you, Georgia. As we’re thinking about all this possibility, what do you think is holding frontier models back right now?

Georgia: I think about it more in the sense of what is holding us back from achievement. And I think that’s different from what is holding frontier models back. As Marzieh was talking about, a lot of the industry is focused on making really general models. But actually, most of the time you don’t need models that know a ton about art history and also how to code, particularly for business use cases. You have really limited tasks that you want to work on. I think making models compact enough to be cost-efficient, but also able to deliver for those specific tasks, is what is going to hold back advancements in AI. And then from the science perspective, I think it’s really just that we haven’t figured out how to format the data. We have a ton of data, but we haven’t figured out how to serve that to models yet. There’s actually a lot of different problems going on in different areas, and that’s what’s holding back the realization of what we want AGI (agentic AI) to be.

The Real Cost of Running AI: Inference, Tokens, and Infrastructure

Lauren: You said something I want to double-click on, this idea of cost. We all know running these models is extremely costly from a business perspective and from an end-user perspective. Val, I know you work with some of the most innovative AI companies on the planet. What are you seeing as some of their challenges around cost?

Val: It’s very much like the Uber scenario. If you remember taking Uber about 10 years ago, it was very cheap. We found out it was highly subsidized. That’s why they were able to disrupt a lot of taxi industries everywhere. Today, we’re all afraid of surge pricing. We’re kind of in the surge pricing era of AI inference right now, where the more you start to use particular agents, the more tokens they consume and you run out. Instead of just being able to pay for more, we’re such a capacity-constrained industry that you literally hit your rate limit or you’re throttled and you have to pause artificially because of supply constraints.

We see it particularly because in order to scale, you’ve got to scale in a very homogeneous way — where if you need more memory, you’ve got to add more GPUs just to get the memory and underutilize the GPUs. If you need more GPUs, you’re basically tagging along stranded memory for the ride. We haven’t disaggregated the hardware the way we’re, fortunately, disaggregating the software from a prefill-decode perspective.

Lauren: So even if I wanted more advancement, even if I wanted to run more large or small models, I might reach a point where the fundamental infrastructure is just going to block me from being able to work on carbon capture or scientific development?

Val: And I can clear up a misconception about small models too. Everyone thinks small models are automatically cheaper, automatically going to save you money. Not really. You end up using more tokens. The more efficient these small models are, Jevons’ paradox kicks in, and you end up consuming more tokens because a small model is more efficient. So you’re back to the same problem even though these are the problems we want. We’re advancing the science, we have more capability, and we’ve got to figure out how to afford it all.

Small Models vs. Large Models: Misconceptions and Trade-Offs

Lauren: Going to small models, when would you want a small model vs. a large model or a custom model? Marzieh, are there any other misconceptions with small models?

Marzieh: There was a period of time that everyone was racing to scale. The scaling law was great, and the more parameters and the more data we threw at these models, the better they became. But more recently, we are revisiting how the learning is actually happening, because it’s very clear scale helps. With the right setup and more parameters, you have more capacity to learn different capabilities. But it’s not always necessarily required, depending on whether you want a specialized model in an area or even a more general model. We now see multilingual models that are quite good in a smaller size.

So it really boils down to where we do this optimization, whether it’s about the easy part of increasing the size of everything, or, in my opinion, the harder part of doing optimization on a smaller scale, whether it’s about the quality of your data, what data is useful for learning from, or the learning and optimization algorithms you use. These have all been shown in the last year or two: How you can reach the capabilities of models 10 times, 100 times bigger from a year before, with much smaller sizes.

One particular project we worked on recently was training a language model specialized in code. The general idea is that we should have really high-quality code data to train these models — code that passes all unit tests, so code that is right. What we actually saw was that if we relaxed the passing threshold — so the code really didn’t have to be perfect, it was fine if it still failed a few unit tests — that was actually more useful for the model to learn from. It helped the model generalize to unseen cases and also allowed harder problems to get in. So there are a lot of still-interesting open questions in the data space where you’ll be able to train a smaller model that can have the capability of a bigger model that has seen a lot of noisy internet data at a larger scale.

I also think it’s important for the developer community, because smaller models are much easier and more accessible for everyone to use.

Can You Use ChatGPT to Cure Cancer? Custom Models for Scientific Discovery

Lauren: Georgia, I know we’ve talked about not just large and small models, but custom models. I use ChatGPT. I used it to plan my trip to Chile, and it planned an incredible trip to South America for me. Can I just use ChatGPT to cure cancer or do some scientific development?

Georgia: It’s a little bit tougher than that, for a couple of reasons. One thing that’s actually really interesting is even in that example — planning a trip to Chile — unlike with coding, where there’s a verifiable reward, there’s not a clear answer to what is a good trip to Chile. Maybe you looked at some other sites, so you can evaluate that. But for evaluating cures to cancer, the verification pipeline might be 500 million dollars and six months of work. That’s already a really significant jump, not in the same class of problems.

And beyond that, fundamentally most science data is really high-dimensional. We almost all work with transformers — sequence-to-sequence models — which works really well for text and even for proteins, where you also have a sequence of amino acids. But when you’re thinking about cancer data, you’re often talking about whole-slide images: Probably an image that’s a gigabyte by itself for a very small sample of your skin, where every single pixel has a multidimensional embedding of the genes that are in that cell, whether or not they’re cancerous, and other information like that. How do you tokenize that? I don’t think anybody knows the answer to that question.

Lauren: Where would you like to see custom model development go in the scientific community?

Georgia: Honestly, I think the barrier there — though I was just highlighting a technical question — is not actually a technical one. I think it’s mostly a social problem. It has been really difficult for people from the machine learning community and from the domain sciences community to get together and really collaborate. A lot of that is because in the machine learning community we’re often asking, “What do we optimize? What’s my loss?” And when you go talk to a material scientist who’s interested in a particular property and says it would be cool if we understood this — those two things are not inherently compatible at all. So lots of stuff ends up not being done. I think the fundamental issue we have right now is a social one, rather than a technical one. There’s so much more we can do with transformers that we have not been able to do for reasons that have nothing to do with the technology.

Why Multilingual AI Models Matter for a More Inclusive World

Lauren: Marzieh, this reminds me of what we talked about with building more open AI and building more diversity in AI, even with incorporating more languages. I know you’re doing a lot of work at Cohere on how you bring in multilingual support. Where do you see the importance of that? How do you see that advancing models?

Marzieh: The importance of working on multiple languages, I don’t think we need to justify it in a room like this. We all come from different backgrounds, speaking different languages. I think everyone here speaks at least two languages.

And everyone can agree that when you speak different languages, there are concepts and specific nuances to each language that you might not be able to literally transfer and translate between them. The diversity of each language and how each of them captures the human experience is also where this technology started from. If you look at the history of the transformer, and before that attention models, they were all really trying to use neural networks in the field of machine translation, because it’s very challenging to capture meaning in a language and also transfer meaning in a language.

There are practical challenges: The data for many of these languages and how represented that data is, versus how much it’s a translation of existing English data. And then there are societal challenges of how humans would interact with these multilingual models in their own languages: from the safety side, from the bias side, and how requirements and sensitivities may be different in different languages. At the end of the day, it is also a multi-objective optimization problem. There are multiple things you want to learn at the same time.

A big part of what we’ve done at Cohere Labs has been on multilingual models, and I would say — even more importantly — multilingual evaluation. That is something that, to your question about what is holding this technology back, good evaluation is something we definitely need.

How Should We Evaluate AI Models? The Problem with Benchmarks

Lauren: How do you think the role of good evaluation fits in?

Georgia: You have no idea if you’re making progress unless you have good evals. And evals are kind of unsexy, so not that many people want to work on them, but they’re actually the building blocks of all progress. There’s a great site for people who want to check it out called Artificial Analysis, which has some really interesting evals including IQ points versus cost for different models. There is no progress without good evals.

Lauren: I think you bring up a really good point about IQ points vs. cost and how we are measuring the efficacy or success of models today. I don’t think it always makes the most sense that you’re judging on how many tokens and what is the quality of it. I think there needs to be a fundamental rethink and evolution of how we measure the success of this.

Marzieh: What we actually see nowadays is a disconnect between the benchmarks that we have and the real-world capabilities of the models — the “vibe checking” of the model. The disconnect partially comes from how the evals become saturated or contaminated very quickly. We have been working on creating benchmarks in our research lab, but more recently I feel like that might not be the way to go, because very quickly everyone can train their models on that benchmark, intentionally or unintentionally. Each benchmark might capture one specific specialty and expertise, but this overall sense of what model is doing better — what it feels like — is not really captured. I think just rethinking the framework of evaluation: Should it be just one score or a leaderboard? And what are other ways, now that this technology is so good that it’s catching up with every benchmark right away, that we can test it out?

Lauren: I couldn’t agree with you more. In some ways, we don’t even know how much some of this costs, and how can you understand if it’s good or bad without even understanding costs? And how would you run a business without knowing this is what success looks like, and this is what my customer needs, and this is the fundamental costs? We are not even on day one of AI if we think about this. Val, I know you work with so many businesses right now thinking about the success and efficacy of AI. What are you seeing? How do you think we should be looking at the success and evaluating the frontier models that we’re using?

Val: It’s such an interesting topic. Artificial Analysis is finally starting to benchmark multi-turn, longer-context conversations. Even their data from the past year and 18 months is outstanding — not just on quality and the IQ portion, but they’ve added cost. They measure the amount of reasoning tokens each model uses if it’s a reasoning model. A lot of the traditional benchmarks we relied on are really saturated right now. I’m also a big fan of the ARC Prize and the ARC-AGI benchmarks. ARC-AGI-1 was cool because it proved you could literally reason out of distribution, outside of the training dataset, but at pretty enormous cost. People were guessing: Was it 16 hours to complete some of these? Was it over $1 million to saturate ARC-AGI-1? ARC-AGI-2 is largely unsaturated right now because they’re factoring in efficiency as a core part of the result. It’s important to have quality and intelligence, but if it’s not practical, it’s just not going to get used.

Some of the techniques that are coming out now — NVIDIA is leading the way and saying our general purpose GPUs are amazing at training and surprisingly good at inference, but inference now at scale is so important that they preannounced the Rubin CPX processor, which is just for the prefill part of inference, showing you that you really do need to disaggregate prefill and decode. You do need to really optimize your infrastructure to be able to afford all of the ambitious aspirational goals we have for AI at scale.

Lauren: I think a lot about this idea of commercially viable AI. To the point you made earlier about the early days of Uber being subsidized — this is the advent of where AI is going to go. For everybody here, what fundamental trade-offs are we navigating to get to this idea of commercially viable, but also viable for society? What are the trade-offs we have to navigate from model size, efficiency, cost, capacity, success?

Val: There’s a classic trade-off between throughput and latency for users. You can have a private jet to get a few people somewhere really, really fast. But if you put people on a bus, it’s way more efficient — but that bus is going to get to the destination way, way slower. We still struggle with that right now in terms of having really low latency, but for an affordable broad community of users, and ideally in the same batch inference. So that’s one of the fundamental trade-offs we still make.

Marzieh: I would add to that, just trying things out to understand them a little bit better first. Reasoning models are super popular and very impressive, but there has been more recent research on how much of the reasoning trace is actually useful, or does the model actually need it to get to the answer? There have been papers showing just essentially randomly removing the bottom half of the reasoning trace — the model would still do great. How much of it is for the benefit of the human who is looking at it to interpret the steps and how the model got to the answer? And how much of it is actually needed to get to the answer? These reasoning traces are just long inference-time compute that we are spending. One thing I think is important is that, with every new way we find that we can use these models even a little bit better, we should do that in a systematic, scientific way — really studying what it is about this particular way that we’re training or using these models that is helping, and how we can do that more efficiently and more effectively.

Georgia: I think also part of the reason people were motivated to do reasoning tracing was so that you could do, particularly in math and coding, step-wise corrections. That makes a lot of sense when you’re training a model, but it’s not clear that you need this at inference. Could you have it in training in a way that would cut it at inference?

Marzieh: Exactly.

Georgia: And I think from a business perspective, it’s important for people to think about when they need AI. There have been papers about how AI can lead to lots of work inefficiency. It’s worthwhile to think before you go and ask ChatGPT to do the presentation you don’t want to make, whether or not that’s a good thing to have AI doing for you. That would also benefit the environment. I think there’s also a place for small models here — particularly if you have an agent in your email. Banal responses to very common work interactions do not need GPT-4 or 5. You could probably have a Quinn half-billion-parameter model do that for you just as well. As prices rise for models that are actually really expensive to run, it’s going to become much more about matching the correct model to the correct task.

AI Energy Consumption: Making Artificial Intelligence Sustainable

Lauren: You mentioned the energy side of this, how power-hungry AI is. Val, what are you seeing in terms of power consumption? Are there ways we can make AI not just commercially viable, but also sustainable?

Val: It’s crazy when you actually do the math. Just putting in two PDFs into a medium-sized prompt — whether it’s a ChatGPT session or the first turn of hundreds of turns in an agent session — just the prefill for doing that consumes the entire energy usage of a household for a day. More than 20 KW to just start a chat session, or certainly an agent session. And because we run out of memory so quickly in a parallel, concurrent, multi-sub-agent task environment, we’re re-prefilling again after minutes. Every few minutes per agent subtask, that GPU is redundantly re-prefilling the early part of context over and over again. We’re running these AI factories as if it was before the Model T moment, before we had assembly lines. We have to get way more efficient in our token pipelines so we’re not wasting energy unnecessarily before we actually consume it for productive, valuable things.

Lauren: Georgia, how are you thinking about the energy consumption challenge?

Georgia: At Hugging Face, we really focus on small models — that’s where we really invest and put our energy. Part of that is also to enable a much broader community to use AI. I think otherwise, hopefully with AI for science, we discover great methods for carbon capture. And maybe we should also use AI less (which is not a very good tagline for the World Summit AI).

Lauren: But it is, to what we said earlier, use AI smartly. Once you know that uploading a couple of PDFs is the equivalent of the energy to run your house, do I need to ask ChatGPT how to get to the World Summit AI? Or can I just do a super-quick Google search? Or actually interact with a person and ask them? I know this is about the next step for the most advanced AI systems, but human interaction might be a good thing from time to time.

The Case for Open Source AI: Transparency, Access, and Innovation

Lauren: Marzieh, I’d love to talk about open source and why we should look at open source AI models.

Marzieh: This technology has been built on open research. What open source helps is everyone building on top of each other, everyone learning from each other’s mistakes and successes and adding to that. It’s also a really great way to create transparency when you share the details of your work, whether it’s your models or publishing papers on your methods. Transparency also helps you reproduce, replicate, and check: When something is open, it’s less likely to be for the benefit of a small group in terms of how it’s designed. At the end of the day, it just helps advance this technology. Hugging Face is a great example of really advocating for open datasets and open models. We have partnered with them a lot over the last couple of years. We’ve released our Aya models and the Command models on Hugging Face with open weights, and we have seen how that actually helps people pick this up and build something we wouldn’t have even predicted, (something that) would be a particular use case or follow-up for a particular project.

Lauren: I like a lot of what you are talking about and doing at Cohere, where it is almost AI for good. This is why you have multilingual support, this is why you lean into open source, because truly democratizing AI is not meant for the haves and have-nots. How can you make this more accessible for everyone so we can do more good in the world?

Marzieh: I think that’s an ideal scenario if we end up actually improving everyone’s life with this technology. There have been a few times in human history when that has happened. The Internet is a good example; it really elevated connection globally and a lot of positive things came out of that. With this technology, we can make it accessible for everyone and improve some parts of their lives or their work without also hopefully destroying the planet.

Val: Building on the open source theme, I can’t imagine this industry without open source. Just to be pedantic, there’s a lot of controversy in the open source AI community that most of the models are not fully open. The weights are open and some of the recipes are open, but a lot of the datasets aren’t necessarily open. We need to do a much better job of encouraging — creating financial and social incentives — to share the data as well as the models, in certain fields particularly.

DeepSeek is an example. It actually educated some of the big commercial labs on techniques like KV cache offloading early on, and their open infrastructure index on GitHub. They publish so prolifically and have contributed so much. One of the ways the audience can benefit is engaging; even if you just consume a lot of the great papers that are published alongside an open model, you can look at the theory, look at the practice, and play with it yourself on your own local PCs with a Llama-type server. Or if you really want to contribute to what’s become probably the most popular community in the open source AI world — which is under the Linux Foundation, the vLLM community — it’s a very popular inference server and there’s a lot of innovation happening there around reinforcement learning, inference at scale, and improving training. The closed labs publish a lot, but the open source model labs are by far the best place to learn — not just the science, but the engineering and the application of AI.

The Future of AI: Predictions for One Year, Three Years, and Beyond

Lauren: Frontier models will change what’s possible. They will reshape economies. It’s currently reshaping all of our lives, industries, and knowledge work. What do you think the future of AI is going to look like in one year, three years? Who should go first?

Val: I’ll start because it’s kind of a fun, slightly controversial take. I’m old enough to remember when the Internet first started, around Y2K or even before that. It was a big thing in Silicon Valley to go pitch VCs and say, “I’m an Internet startup.” And it meant something 20, 25 years ago. If you describe your company as an Internet company today, it’s meaningless — what does that even mean? I think at the pace of progress and acceleration in AI, within three years I predict if you say you’re an AI company or an AI startup, it’ll be meaningless. Like, are you applying it for healthcare? Are you applying it for security, for multilingual use cases, for science, or just for entertainment or social media? You’re going to have to basically assume AI is everywhere, that it will be at some level affordable, or at least there’ll be market pricing that lets you choose the right models and the right infrastructure properly, and that you’re actually solving real problems — whether they’re business problems, healthcare problems, or even geopolitical problems. The ubiquity of AI, I think, will happen faster than we can predict today.

Marzieh: For me, next year is easier to predict. I’m very excited about coordination and collaboration. I think we are now past the first stage of developing models that are really good, and now with multi-agent scenarios and how these models can coordinate and collaborate together to solve even more complex problems — that is something we’ve seen happening now, and there are a lot of interesting questions and problems to study for the next year or two. And I like to think that maybe in five years or longer in the future, we will have this technology in places we never thought it was going to be in a more positive way. Things that are so out of reach right now. These innovations, you can build them incrementally over the years, but at some point you also have to explore without any objective, to not just have one objective you want to optimize for. I like to think that in a few years, this exploration will land us in a place we cannot really think of right now.

Georgia: I think something we haven’t talked about much — and something more concrete — is the hopefully significant decrease in manual labor that people will be doing. I think that’s AI, but it’s also robotics. Huge headway is being made there, but so far the AI reasoning community has been pretty different from AI for the robotics community, which has been — you may have seen the videos — very focused on folding clothes. It’s a very difficult task, absolutely no hate. But I’m really optimistic that in three years, something like 50% less manual labor will be done by humans. Something like that.

Lauren: I love all those answers. I think about the work each of your companies is doing and how it is building for the future. Whether the work you’re doing on the science side (Georgia), where there very well could be a world where we are curing cancer at an accelerated rate because of AI. The work, Marzieh, you’re doing with Cohere and bringing in more diversity from multiple languages is not only going to make our AI smarter, it’s going to learn from having different perspectives. And Val, the work you’re doing at WEKA to increase token throughput — I know you’ve recently done a benchmark where you’ve achieved more than 4x token throughput on the exact same infrastructure — that’s going to improve our costs and energy consumption. If we can get this cost-effective and less power-hungry, we can build the viability to actually make the world a more diverse and inclusive place, to cure cancer, to give us all a better future.

Rapid Fire: What Is the End Game for Artificial Intelligence?

Lauren: Quick rapid-fire final question. Less than 30 seconds each. It’s going to be my hard question that none of you know in advance.

What’s the end game for AI?

Georgia: Hopefully a world where people are happy. I think it’s really easy to get lost in the idea of what progress is without any clear goal. So, bring back values. That’s my takeaway.

Marzieh: Improving human life. Yeah. I like that.

Val: To Georgia’s point earlier, the end game is something we’re not imagining yet, something we’re not seeing yet. Going back 20, 25 years ago, if you were to say you’re going to get into a stranger’s car and let this stranger drive you somewhere, or rent your couch to a stranger and share your bathroom, you would have been considered completely crazy. And that’s what the Internet really enabled, these really creative use cases. We haven’t seen this kind of creativity really yet. It’s a skeuomorphic thing, which is an old Apple term for just doing something old but better. It’s about doing brand-new things that you haven’t imagined yet. That’s the end game.

Lauren: I love it. Thank you so much for joining our panel. Thank you everyone for being here. Please thank our panelists.

Related Resources