VIDEO

LinkedIn’s AI Infrastructure Secrets for 1.2B Users

Aminesh Singh of LinkedIn's AI Platform & Infrastructure team sits down with WEKA Chief AI Officer Val Bercovici and shares strategies for AI efficiency at scale. From breaking down team silos to treating GPUs as pets instead of cattle, LinkedIn cost-effectively serves hundreds of millions of users.

Speakers:

  • Val Bercovici - Chief AI Officer, WEKA
  • Animesh Singh - AI Platform & Infrastructure - LinkedIn

Below is a transcript of the conversation, which has been lightly edited for clarity.

Transcript

00:00

AI Infrastructure Efficiency from GPUs to Data Layer Optimization

Val Bercovici: So Animesh, we’re catching you literally right before you’re about to go on stage and give a talk, which is a topic I’m really interested in. So take it over and tell us what you’re about to talk about.

Animesh Singh: Thank you. The session I’m giving is on AI infra efficiency. My role in LinkedIn is I run the GPU fleet powering LinkedIn’s AI platform and infrastructure, and a lot of the infrastructure and platform components.

Essentially, what I’m going to go through in the talk is why efficiency plays a big role. I think we all know AI is expensive. It takes a lot to put up a scalable AI infrastructure, both for continuous pre-training, fine-tuning, and then the post-training plus inference landscape. And it increasingly is getting more and more expensive as the models are getting smarter and have more architecture. They are more compute hungry.

So in the session, I do go through different layers of the infrastructure, starting from the GPUs to the network, going up to the compute layer, advanced GAN scheduling, schedulers, global scheduling, network-aware scheduling, workload-aware scheduling, how you can insert efficiency elements into all of that, what we have done in LinkedIn around giving GPU utilization information across all different layers.

And then you start going up to the software layers, which is the PyTorch, TensorFlow, LLMs, what all elements you can bring there to actually further optimize for efficiency. And also, even going beyond that, in terms of beyond the frameworks, what can you do on the data side, because AI/ML is essentially powering on top of data, consuming a lot of data. What can you bring there to bring more efficiency? So it’s a top-down efficiency-focused talk here.

02:05

Balancing Recommendation Models and Large Language Models at Scale

Val: Spoken again like someone that validates a lot of what I know from my other friends at LinkedIn. I would say LinkedIn is very high on the maturity curve and the maturity scale for machine learning and AI. And a couple of really interesting topics I’d love to dive into, but let’s start at maybe one of the basic ones.

Your maturity has let you realize that you don’t really run one model or just a few models. You have some very customized and tailored models you run maybe at a higher volume. You have other models that might be more traditional, large language models, and so forth. So how does that mix play out? How do you determine the mix of models? And generally, without getting too specific, what kind of models are you running in production right now?

Animesh: At its heart, LinkedIn is a social media platform. The traditional function of social media platforms, the way they have been powered, has been with recommendation and ranking models. When you go to LinkedIn, the feed you see, the people recommendations you get, job recommendations you get, search you drive—all those capabilities are powered by AI/ML models, which has been traditionally recommendation-ranking models, and that still remains a huge volume.

Now I think after the advent of ChatGPT and generative AI, we have been launching a lot of generative AI capabilities as well, profile summarization, recruiter email summarization, LinkedIn learning agents. The first agent we actually launched, which was totally agentic in nature, LinkedIn hiring assistant or LIHA, is essentially an agent for recruiters to go and source candidates based on criteria the recruiters define. It works in the background, finds relevant candidates, ranks them, and it goes through the whole lifecycle of working with the recruiters in terms of getting to the right candidates. So that has been launched. We are seeing more and more uptick in LLM-powered use cases for job search. They’re getting more LLM-powered. But still, the dominant mechanism remains the recommendation and ranking models, DCNv2, TransAct, generative recommended models.

Val: And if you were to rank them in a mix, which would you consider small vs. larger language models?

Animesh: The world is divided into two areas. When you are training or you’re doing continuous pre-training or even when you are doing post-training, RL fine-tuning, the volume of the model or the size of the models is pretty big, at times going beyond 100 billion parameters. Those are obviously large language models.
When you have to bring those models to inference, that same size wouldn’t work because LinkedIn has the number of registered users on the platform, around 1.2 billion users. At any given point in time, hundreds of millions of users are actually active. To run inference at that particular large scale with these large models, it’s ROI-negative. So you have to bring down the size.

And typically, one of the tenets we have been following is, yes, on training, go as large as you want. LinkedIn has a lot of data. There is the LinkedIn knowledge graph of the number of people, the number of companies. But when we come to inference, you use various techniques like distillation, pruning, quantization, and then further compress the model size. The sweet spot we have found is around 7, 8 billion parameter models.

Val: Active parameters?

Animesh: Yeah. They are being very effective at scale.

Val: Very helpful.

05:46

Breaking Organizational Silos for Cost-Effective AI Infrastructure

Val: I always see you guys as being about one or two years ahead of the bell curve. You probably always will be, but as what you’re doing today becomes a bell curve on Fortune 500s and Global 2000s, what would be your advice for how to actually figure out what to optimize, how to make it unit economic-positive, basically gross margin-positive? What are some of the key lessons you’ve learned in terms of that journey of taking a functional set of models, but then making them cost-effective to run to infer at scale?

Animesh: I think that’s essentially right. Optimization has to be a key focus area when you are doing AI. I think one thing which has worked out really well is that the silos have been broken. Typically, in the past, when you were doing ML and AI, the modeling team (or prior to that, the data scientist team), they were sitting in a different building and would say, “Here is my model.” They will throw it over the wall and go and serve it in production. Now there is a lot of co-design happening.

Val: Nice.

Animesh: And I think that co-design is necessary. The characteristics of the model and customizing the infra to fit that model, typically, is a process. Sometimes that process can go on for two months, three months. But effectively to inference at that particular scale, you really have to ensure these two teams are mingling. The modelers and the platform and the infra folks, they are working in the same room over a period of time to make sure the infra is totally co-optimized, top-down. I think that is one paradigm which is going to stay, and we are seeing more of that even going to the bottom layers.

One of the things, typically, if you think of AI infrastructure and platform, it typically sits on a data center, which is a different team in the org, the storage, which is a different team in the org, compute, which is a different team in the org. AI infrastructure platform builds on top of this. But even there, it has been a very contractual model that you deliver this capability and we will build on top. But we are seeing more and more co-design actually helping there, working very closely with the storage team that, hey, data is my biggest bottleneck, and I need to get data as fast as possible to the GPUs so we are not keeping the GPU idle. How can we co-design the storage layer, which is probably more customized for the AI world? So that’s the need of the hour. Every infrastructure team is also now working with the AI infrastructure to co-optimize and co-design.

The infrastructure and platform teams being in the middle have to play that role very strongly, be very adaptable, work with your users, the modeling team which are designing the models, and work with the rest of the infrastructure teams on which you are building the platform to make sure this co-design is happening and you are nimble and adaptable as you’re scaling this out. Otherwise, the projects will get delayed and fall through if we are not being very proactive about it.
Val: That’s such a fundamental point. I still think that’s the exception, not the rule, in terms of how big AI organizations are structured. I still see firewalls between researchers and infrastructure operations folks. We know Cohere, for example, domestically was one lab which did the fusion of those two teams. DeepSeek famously published their papers about doing that. So it’s really a cool trend to see.

09:10

Treating GPUs as Pets Not Cattle and Building for Failure

Val: What are your thoughts around some of the things maybe that are so different that it’s easier to unlearn what you’ve done before in the cloud or the pre-cloud VMware world and the accelerated computing world as against what we’ll call the GPU world? Networks are different. You said the storage is different. In my mind, storage almost has to be memory in this world. But what are some of the key things that are different now that you’ve seen your teams adapt to, as you said, and then co-design, co-integrate?

Animesh: Good question. I think the key thing is essentially, the place where it starts is the GPU architecture itself. How are the CPUs on the GPU nodes spread out? What is the HBM memory? What role does the HBM memory play in the GPU SMs? I think most of the core design, when you’re actually thinking of efficiency, starts from there. And that’s a totally different paradigm than CPUs.

It’s whether GPUs are pets or cattle. And CPUs, you used to treat them like a cattle farm. You were not so much focused on designing around CPU characteristics, but you have to totally shift the pendulum here that these are pets.
And there is one aspect that GPUs are also failing more often than CPUs. On an average, we are seeing with the H100s, H200s, 10 percent thermal stress

Val: Which is very high compared to the CPU.

Animesh: There are quite a bit of failures because these things are running at very high velocity, churning out tons of compute. Failure rate is also high as the next generation of GPUs get more and more mature.
So the other part, resiliency is not an afterthought. Once you have mastered the GPU architecture, assume failures. When you are running distributed training at scale across a fleet of one hundred GPUs, one thousand GPUs, two thousand GPUs, failure is bound to happen. Some part of the infrastructure where either it’s a network, it’s storage, or some other parts of infrastructure, these training jobs you have to make sure resiliency is built from the ground up that, yes, failure will happen.

Val: They’re so tightly coupled, do you want to decouple that in case—well, not in case of, but when the inevitable failure happens?

Animesh: Yeah. How quickly you checkpoint, restore, recover, reschedule the jobs in the queue. All of their time is latency. And by the time you are doing that, the next set of GPUs are sitting idle waiting for this to come. So make sure you recover very quickly, reschedule, and get back in the mix. I think that’s a fundamental thing which teams have to do, much better in this new world of GPUs here.

12:00

Memory Tiering Strategies and KV Cache Optimization for GPU Efficiency

Val: So it sounds like we’re almost out of time. So let me just leave you with one final and maybe geeky question. Where are you on the memory tiering, memory hierarchy front? Because you mentioned HBM is a key resource here. We’re seeing obviously for the larger models, the more active parameters, memory tiering to CPU DRAM. Now NVIDIA and the open source community are popularizing tiering even to storage. What are your thoughts on where you are, where you think you will be with regards to KV cache and memory tiering?

Animesh: I think all these options have been on the table. I would say memory tiering should be—I mean, we started with that hypothesis for certain use cases that we definitely need to go from HBM to DRAM to SSD-based storage. You have to play out that whole scenario because there’s a lot of data you’re packing in the HBM, and you cannot pack it. So then you bring in software optimizations using KV cache.

I think so far, we have seen that most of the optimizations using KV cache and HBM have been successful, right? At most, we are defaulting to the CPU memory on the GPU nodes, but we still haven’t, even though on paper the initial hypothesis was that we will have to go to some sort of SSDs, so far so good. And most of the optimizations we are playing out in the actual KV cache layer, and that’s working out. So that’s where we are.

Whether external storage will become a necessity or not, we’ll see. But I feel there is a cost where as soon as you start going to external storage, you need to have GPU direct communication because communication patterns increase. You’re getting more real estate, so you have to worry about other problems. Our strategy has been to continue optimizing in that particular area on the GPU node. You have the CPU memory, GPU memory. And within that, what optimizations can you continue doing around KV cache.

Val: Absolutely. So what you’re also saying, if I were to paraphrase, is that memory tier performance is essential, right? And especially as GPUs get faster and pre-fills get faster, you still have a high bar to meet in terms of KV cache offloading. Makes perfect sense.

I’d love to continue this, but I think you’ve got to go talk on stage and hopefully about some of this.

Animesh: It was awesome talking to you. And definitely, at some point, it will be good to go through that whole journey as we are going through these different layers.

Val: Yeah. And I’d love to learn more. Thanks for your time.

Animesh: Thank you. Thanks a lot.

Like This Discussion? There’s More!

Be sure to catch Shimon Ben-David’s keynote at AI Infra Summit for more on how to solve inference challenges with a memory-first architecture.