LinkedIn’s AI Infrastructure Secrets for 1.2B Users
Cool. So Animesh, we're catching you literally right before you're about to go on stage and give a talk, which is a topic I'm really interested in. So take it over and tell us what you're about to talk about. Thank you. Yeah. So I think the the session I'm giving is on AI infra efficiency. My role in LinkedIn, right, like, run the GPU fleet, right, powering LinkedIn's AI platform and infrastructure, and then a lot of the infrastructure and platform components. And essentially, what I'm going to go through in the talk is why efficiency plays a big role. I think we all know AI is expensive. It takes a lot to put up a scalable AI infrastructure, both for continuous pre training, fine tuning, and then the post training plus inference landscape. And it increasingly is getting more and more expensive as the models are getting smarter and have more and more architecture. They are more compute hungry. So in the session, I do go through different layers of the infrastructure, starting from the GPUs to the network, going up to the compute layer, advanced GAN scheduling, schedulers, global scheduling, network aware scheduling, workload aware scheduling, how you can insert efficiency elements into all of that, what we have done in LinkedIn around giving GPU utilization information across all different layers. And then you start going up to the software layers, which is the PyTorch, TensorFlow, LLMs, what all elements you can bring there to actually further optimize for efficiency. And also, even going beyond that, in terms of beyond the frameworks, what can you do on the data side, because AI/ML is essentially powering on top of data, consuming a lot of data. What can you bring there to bring more efficiency? So it's a top-down efficiency-focused talk here. Spoken like someone, again, that validates a lot of what I know from my other friends at LinkedIn is that I would say LinkedIn is very high on the maturity curve and the maturity scale for machine learning and AI. And a couple of really interesting topics I'd love to dive into, but let's start at maybe one of the basic ones. Your maturity has let you realize that you don't really run one model or just a few models. You have some very customized and tailored models you run maybe at a higher volume. You have other models that might be more traditional, large language models, and so forth. So how does that mix play out? How do you determine the mix of models? And generally, without getting too specific, what kind of models are you running in production right now? Yeah, I think, see, at its heart, LinkedIn is a social media platform. The traditional function of social media platforms, the way they have been powered, has been with recommendation and ranking models. When you go to LinkedIn, the feed you see, the people recommendations you get, job recommendations you get, search you drive. All those capabilities are powered by AI/ML models, which has been traditionally recommendation-ranking models, and that still remains a huge volume. Now I think after the advent of ChatGPT and generative AI, we have been launching a lot of generative AI capabilities as well, like profile summarization, recruiter email summarization, LinkedIn learning agents. The first agent we actually launched, which was totally agentic in nature, LinkedIn hiring assistant or LIHA, which is essentially an agent for recruiters to go and source candidates based on criteria the recruiters define. It goes, works in the background, finds relevant candidates, ranks them, and it goes through the whole life cycle of working with the recruiters in terms of getting to the right candidates they should be. So that has been launched. So we are seeing more and more uptick in LLM-powered use cases or job search, etcetera, right? Like, they're getting more LLM-powered. But still, the dominant mechanism remains the recommendation and ranking models, DCNv2, TransAct, generative recommended models right there. Yeah, right there. And if you were to sort of rank them in a mix, which would you consider small versus larger language models? See, the general landscape, right, like it's the world is divided into two areas. When you are training or you're doing continuous pre-training or even when you are doing post-training, like RL fine-tuning, there the volume of the model or the size of the models is pretty big, right? Like at times, going beyond one hundred billion parameters. Those are obviously large language models. When you have to bring those models to inference, that same size wouldn't work because LinkedIn has the number of registered users on the platform, around one point two billion users. At any given point in time, hundreds of millions of users are actually active. To run inference at that particular large scale, with these large models, it's ROI-negative. So you have to bring down the size. And typically, one of the tenets we have been following is, yes, on training, go as large as you want. LinkedIn has a lot of data. There is the LinkedIn knowledge graph of the number of people, the number of companies, what we call. But when we come to inference, you use various techniques like distillation, pruning, quantization, and then further compressing the model size. The sweet spot we have found is around seven, eight billion parameter models. Are being very effective. Active parameters? Yeah. They are being very effective at scale. Very helpful. So I always see you guys as being about one or two years ahead of the bell curve. You probably always will be, but as what you're doing today becomes a bell curve on Fortune 500s and Global 2000s, what would be your advice for how to actually figure out what to optimize, how to make it unit economic positive, basically gross margin positive? What are some of the the key lessons you've learned in terms of that journey of taking, you know, a functional set of models, but then making them cost-effective to run to infer at scale? I think that's essentially right. Like, see, optimization has to be a key focus area when you are doing AI. I think one thing which has worked out really well, that the silos have been broken. Typically, in the past, when you were doing ML and AI, the modeling team or prior to that, the data scientist team, they were sitting in a different building, different and here is my model. They will throw it over the wall and go and serve it in production. Now there is a lot of co-design happening. Nice. And I think, you know, that co-design is necessary. The the characteristics of the model and, customizing the infra to fit that model, typically, is a process. Sometimes that process can go on for, you know, two months, three months. But effectively to inference at that particular scale, really have to ensure that these two teams are mingling. The modelers and the platform and the infra folks, they are working in the same room, right, like over a period of time to make sure the infra is totally co-optimized top down. I think that is one paradigm which is going to stay, and we are seeing more of that even going to bottom layers. Like, one of the things, typically, if you think of AI infrastructure and platform, it typically sits on a data center, right, which is a different team in the org, data infra, the storage, which is a different team in the org, compute, right, which is a different team in the org. AI infrastructure platform builds on top of this. But even there, it has been a very contractual model that, you know, you deliver this capability, we will build on top. But we are seeing more and more co-design actually helping there, working very closely with the storage team that, hey, data is my biggest bottleneck, and I need to get data as fast as possible to the GPUs so that we are not keeping the GPU idle. How can we co-design the storage layer, which is probably more customized for the AI world, that's the need of the hour. So every infrastructure team is also now working with the AI infrastructure to co-optimize and co-design. So so the infrastructure and platform teams being in the middle have to play that role very strongly, be very adaptable, work with your users, the modeling team, which are designing the models, and work with the rest of the infrastructure teams on which you are building the platform to make sure that, you know, this co-design is happening and you are nimble and adaptable as as you're scaling this out. Otherwise, yeah, the the projects will get delayed and fall through if we are not being very proactive about it. That's such a fundamental point. I still think that's the exception, not the rule, in terms of how big AI organizations are structured. I still see firewalls between researchers and infrastructure operations folks. We know Cohere, for example, domestically was one lab which did the fusion of those two teams. DeepSeek famously published their papers about doing that. So it's really a cool trend to see. What are your thoughts around some of the things maybe that are so different that it's easier to unlearn what you've done before in the cloud or the pre cloud VMware world and the accelerated computing world as against what call the GPU world? Networks are different. You said the storage is different. Know, in my mind, storage almost has to be memory in this world. But what are some of the key things that are different now that you've seen your teams, you know, adapt to, as you said, and then co-design co-integrate? Good question. I think, see, the the key thing is essentially the fundamental, the place where it starts is the GPU architecture itself. How is how are the CPUs on the GPU nodes spread out? What is the HBM memory? What role the HBM memory plays in the GPU SMs? I think most of the core design, when you're actually thinking of efficiency, starts from there. And that's a totally different paradigm than CPUs. It's like whether GPUs are pets or cattle. Of course, there are pets or cattle. And CPUs you used to treat like a cattle farm. You were not so much focused on designing around CPU characteristics, but that you have to totally shift the pendulum here that, you know, these are pets. And and there is one aspect that GPUs are also failing more often than CPUs. Like, on an average, we are seeing with the H100s, H200s, ten percent thermal stress which is very high compared to the CPU time outs. Like, there are quite a bit of failures because these things are are running at very high velocity, churning out tons of compute. Failure rate is also high, right, like as as as as the next generation of GPUs get more and more mature. So the other part, resiliency is not an afterthought. Once you have mastered the GPU architecture, assume failures. When you are running distributed training at scale across a fleet of one hundred GPUs, one thousand GPUs, two thousand GPUs, failure is bound to happen. Some part of the infrastructure where either it's a network, it's a storage or some other parts of infrastructure, these these training jobs like, you know so you have to make sure resiliency is taught from grounds up that, yes, failure will happen. Decouple. Do you wanna decouple that in case well, not not in case of, but when the inevitable failure happens? Yeah. Hold your quickly checkpoint, restore, recover, reschedule the jobs in the queue. All of their time is is, you know, latency. And and by the time you are doing that, next set of GPUs are sitting idle waiting for this to come. So make sure, you know, you recover very quickly, reschedule, and get back in the mix. There is I think, you know, that's that's a fundamental thing, right, which teams have to do, much better, like, you know, in in this new world of GPUs here. So it sounds like we're almost out of time. So let me just leave you with one final and maybe geeky question. Where are you on sort of the memory tiering, memory hierarchy front? Because you mentioned HBM is like a key resource here. We're seeing obviously for the larger models, the more active parameters, memory tiering to CPU DRAM. Now NVIDIA and the open-source community are popularizing tiering even to storage. What are your thoughts on where you are, where you think you will be with regards to KV cache and memory tiering? I think, see, all these options have been on the table. Would say memory tiering should be I mean, we started with that hypothesis for certain use cases that we definitely need to go from HBM to DRAM to SSD-based storage. You have to play out that whole scenario because there's a lot of data you're packing in the HBM, and you cannot pack it. So then you bring in software optimizations using KV cache. I think so far, have seen that most of the optimizations using KV cache and HBM have been successful. Right? Like, at most, we are defaulting to the CPU memory on the GPU nodes, but we still haven't even though on paper, right, the initial hypothesis was that we will have to go to some sort of SSDs, so far so good. And most of the optimizations we are playing out in the actual KV cache layer, and that's working out. So that's where we are. Whether external storage will become a necessity or not, we'll see. But I feel there is a cost where as soon as, you know, you you start going to external storage, you need to have, you know, GPU direct communication because communication patterns increase. Yes. You're getting getting more real estate. Yeah. So you have to worry about other problems. So our strategy has been to continue optimizing in that particular are on the GPU node. You have the CPU memory, GPU memory. And within that, what optimizations you can continue doing around KV cache. Absolutely. So what you're also saying, if I were to paraphrase, is that memory tier performance is essential, right? And especially as GPUs get faster and pre-fills get faster, you still have a high bar now to meet in terms of KV cache offloading. So makes perfect sense. Well, I'd love to continue this, but I think you've got to go talk on stage and hopefully about some of this. No. It was awesome talking to you. And definitely, at some point, it will be good to go through that whole journey as we are going through these different layers. Yeah. And I'd love to learn more. Thanks for your time. Thank you. Thanks a lot.
Speakers:
- Val Bercovici - Chief AI Officer, WEKA
- Animesh Singh - AI Platform & Infrastructure - LinkedIn
Below is a transcript of the conversation, which has been lightly edited for clarity.
Transcript
AI Infrastructure Efficiency from GPUs to Data Layer Optimization
Val Bercovici: So Animesh, we’re catching you literally right before you’re about to go on stage and give a talk, which is a topic I’m really interested in. So take it over and tell us what you’re about to talk about.
Animesh Singh: Thank you. The session I’m giving is on AI infra efficiency. My role in LinkedIn is I run the GPU fleet powering LinkedIn’s AI platform and infrastructure, and a lot of the infrastructure and platform components.
Essentially, what I’m going to go through in the talk is why efficiency plays a big role. I think we all know AI is expensive. It takes a lot to put up a scalable AI infrastructure, both for continuous pre-training, fine-tuning, and then the post-training plus inference landscape. And it increasingly is getting more and more expensive as the models are getting smarter and have more architecture. They are more compute hungry.
So in the session, I do go through different layers of the infrastructure, starting from the GPUs to the network, going up to the compute layer, advanced GAN scheduling, schedulers, global scheduling, network-aware scheduling, workload-aware scheduling, how you can insert efficiency elements into all of that, what we have done in LinkedIn around giving GPU utilization information across all different layers.
And then you start going up to the software layers, which is the PyTorch, TensorFlow, LLMs, what all elements you can bring there to actually further optimize for efficiency. And also, even going beyond that, in terms of beyond the frameworks, what can you do on the data side, because AI/ML is essentially powering on top of data, consuming a lot of data. What can you bring there to bring more efficiency? So it’s a top-down efficiency-focused talk here.
Balancing Recommendation Models and Large Language Models at Scale
Val: Spoken again like someone that validates a lot of what I know from my other friends at LinkedIn. I would say LinkedIn is very high on the maturity curve and the maturity scale for machine learning and AI. And a couple of really interesting topics I’d love to dive into, but let’s start at maybe one of the basic ones.
Your maturity has let you realize that you don’t really run one model or just a few models. You have some very customized and tailored models you run maybe at a higher volume. You have other models that might be more traditional, large language models, and so forth. So how does that mix play out? How do you determine the mix of models? And generally, without getting too specific, what kind of models are you running in production right now?
Animesh: At its heart, LinkedIn is a social media platform. The traditional function of social media platforms, the way they have been powered, has been with recommendation and ranking models. When you go to LinkedIn, the feed you see, the people recommendations you get, job recommendations you get, search you drive—all those capabilities are powered by AI/ML models, which has been traditionally recommendation-ranking models, and that still remains a huge volume.
Now I think after the advent of ChatGPT and generative AI, we have been launching a lot of generative AI capabilities as well, profile summarization, recruiter email summarization, LinkedIn learning agents. The first agent we actually launched, which was totally agentic in nature, LinkedIn hiring assistant or LIHA, is essentially an agent for recruiters to go and source candidates based on criteria the recruiters define. It works in the background, finds relevant candidates, ranks them, and it goes through the whole lifecycle of working with the recruiters in terms of getting to the right candidates. So that has been launched. We are seeing more and more uptick in LLM-powered use cases for job search. They’re getting more LLM-powered. But still, the dominant mechanism remains the recommendation and ranking models, DCNv2, TransAct, generative recommended models.
Val: And if you were to rank them in a mix, which would you consider small vs. larger language models?
Animesh: The world is divided into two areas. When you are training or you’re doing continuous pre-training or even when you are doing post-training, RL fine-tuning, the volume of the model or the size of the models is pretty big, at times going beyond 100 billion parameters. Those are obviously large language models.
When you have to bring those models to inference, that same size wouldn’t work because LinkedIn has the number of registered users on the platform, around 1.2 billion users. At any given point in time, hundreds of millions of users are actually active. To run inference at that particular large scale with these large models, it’s ROI-negative. So you have to bring down the size.
And typically, one of the tenets we have been following is, yes, on training, go as large as you want. LinkedIn has a lot of data. There is the LinkedIn knowledge graph of the number of people, the number of companies. But when we come to inference, you use various techniques like distillation, pruning, quantization, and then further compress the model size. The sweet spot we have found is around 7, 8 billion parameter models.
Val: Active parameters?
Animesh: Yeah. They are being very effective at scale.
Val: Very helpful.
Breaking Organizational Silos for Cost-Effective AI Infrastructure
Val: I always see you guys as being about one or two years ahead of the bell curve. You probably always will be, but as what you’re doing today becomes a bell curve on Fortune 500s and Global 2000s, what would be your advice for how to actually figure out what to optimize, how to make it unit economic-positive, basically gross margin-positive? What are some of the key lessons you’ve learned in terms of that journey of taking a functional set of models, but then making them cost-effective to run to infer at scale?
Animesh: I think that’s essentially right. Optimization has to be a key focus area when you are doing AI. I think one thing which has worked out really well is that the silos have been broken. Typically, in the past, when you were doing ML and AI, the modeling team (or prior to that, the data scientist team), they were sitting in a different building and would say, “Here is my model.” They will throw it over the wall and go and serve it in production. Now there is a lot of co-design happening.
Val: Nice.
Animesh: And I think that co-design is necessary. The characteristics of the model and customizing the infra to fit that model, typically, is a process. Sometimes that process can go on for two months, three months. But effectively to inference at that particular scale, you really have to ensure these two teams are mingling. The modelers and the platform and the infra folks, they are working in the same room over a period of time to make sure the infra is totally co-optimized, top-down. I think that is one paradigm which is going to stay, and we are seeing more of that even going to the bottom layers.
One of the things, typically, if you think of AI infrastructure and platform, it typically sits on a data center, which is a different team in the org, the storage, which is a different team in the org, compute, which is a different team in the org. AI infrastructure platform builds on top of this. But even there, it has been a very contractual model that you deliver this capability and we will build on top. But we are seeing more and more co-design actually helping there, working very closely with the storage team that, hey, data is my biggest bottleneck, and I need to get data as fast as possible to the GPUs so we are not keeping the GPU idle. How can we co-design the storage layer, which is probably more customized for the AI world? So that’s the need of the hour. Every infrastructure team is also now working with the AI infrastructure to co-optimize and co-design.
The infrastructure and platform teams being in the middle have to play that role very strongly, be very adaptable, work with your users, the modeling team which are designing the models, and work with the rest of the infrastructure teams on which you are building the platform to make sure this co-design is happening and you are nimble and adaptable as you’re scaling this out. Otherwise, the projects will get delayed and fall through if we are not being very proactive about it.
Val: That’s such a fundamental point. I still think that’s the exception, not the rule, in terms of how big AI organizations are structured. I still see firewalls between researchers and infrastructure operations folks. We know Cohere, for example, domestically was one lab which did the fusion of those two teams. DeepSeek famously published their papers about doing that. So it’s really a cool trend to see.
Treating GPUs as Pets Not Cattle and Building for Failure
Val: What are your thoughts around some of the things maybe that are so different that it’s easier to unlearn what you’ve done before in the cloud or the pre-cloud VMware world and the accelerated computing world as against what we’ll call the GPU world? Networks are different. You said the storage is different. In my mind, storage almost has to be memory in this world. But what are some of the key things that are different now that you’ve seen your teams adapt to, as you said, and then co-design, co-integrate?
Animesh: Good question. I think the key thing is essentially, the place where it starts is the GPU architecture itself. How are the CPUs on the GPU nodes spread out? What is the HBM memory? What role does the HBM memory play in the GPU SMs? I think most of the core design, when you’re actually thinking of efficiency, starts from there. And that’s a totally different paradigm than CPUs.
It’s whether GPUs are pets or cattle. And CPUs, you used to treat them like a cattle farm. You were not so much focused on designing around CPU characteristics, but you have to totally shift the pendulum here that these are pets.
And there is one aspect that GPUs are also failing more often than CPUs. On an average, we are seeing with the H100s, H200s, 10 percent thermal stress
Val: Which is very high compared to the CPU.
Animesh: There are quite a bit of failures because these things are running at very high velocity, churning out tons of compute. Failure rate is also high as the next generation of GPUs get more and more mature.
So the other part, resiliency is not an afterthought. Once you have mastered the GPU architecture, assume failures. When you are running distributed training at scale across a fleet of one hundred GPUs, one thousand GPUs, two thousand GPUs, failure is bound to happen. Some part of the infrastructure where either it’s a network, it’s storage, or some other parts of infrastructure, these training jobs you have to make sure resiliency is built from the ground up that, yes, failure will happen.
Val: They’re so tightly coupled, do you want to decouple that in case—well, not in case of, but when the inevitable failure happens?
Animesh: Yeah. How quickly you checkpoint, restore, recover, reschedule the jobs in the queue. All of their time is latency. And by the time you are doing that, the next set of GPUs are sitting idle waiting for this to come. So make sure you recover very quickly, reschedule, and get back in the mix. I think that’s a fundamental thing which teams have to do, much better in this new world of GPUs here.
Memory Tiering Strategies and KV Cache Optimization for GPU Efficiency
Val: So it sounds like we’re almost out of time. So let me just leave you with one final and maybe geeky question. Where are you on the memory tiering, memory hierarchy front? Because you mentioned HBM is a key resource here. We’re seeing obviously for the larger models, the more active parameters, memory tiering to CPU DRAM. Now NVIDIA and the open source community are popularizing tiering even to storage. What are your thoughts on where you are, where you think you will be with regards to KV cache and memory tiering?
Animesh: I think all these options have been on the table. I would say memory tiering should be—I mean, we started with that hypothesis for certain use cases that we definitely need to go from HBM to DRAM to SSD-based storage. You have to play out that whole scenario because there’s a lot of data you’re packing in the HBM, and you cannot pack it. So then you bring in software optimizations using KV cache.
I think so far, we have seen that most of the optimizations using KV cache and HBM have been successful, right? At most, we are defaulting to the CPU memory on the GPU nodes, but we still haven’t, even though on paper the initial hypothesis was that we will have to go to some sort of SSDs, so far so good. And most of the optimizations we are playing out in the actual KV cache layer, and that’s working out. So that’s where we are.
Whether external storage will become a necessity or not, we’ll see. But I feel there is a cost where as soon as you start going to external storage, you need to have GPU direct communication because communication patterns increase. You’re getting more real estate, so you have to worry about other problems. Our strategy has been to continue optimizing in that particular area on the GPU node. You have the CPU memory, GPU memory. And within that, what optimizations can you continue doing around KV cache.
Val: Absolutely. So what you’re also saying, if I were to paraphrase, is that memory tier performance is essential, right? And especially as GPUs get faster and pre-fills get faster, you still have a high bar to meet in terms of KV cache offloading. Makes perfect sense.
I’d love to continue this, but I think you’ve got to go talk on stage and hopefully about some of this.
Animesh: It was awesome talking to you. And definitely, at some point, it will be good to go through that whole journey as we are going through these different layers.
Val: Yeah. And I’d love to learn more. Thanks for your time.
Animesh: Thank you. Thanks a lot.
Like This Discussion? There’s More!
Be sure to catch Shimon Ben-David’s keynote at AI Infra Summit for more on how to solve inference challenges with a memory-first architecture.
Thank you for your WEKA Innovation Network program inquiry.
A WEKA channel representative will respond promptly.
You’re on your way to solving your most complex data challenges.
A WEKA solutions expert will be in contact with you shortly.
Thank you for your interest in WEKA’s Technology Alliance Program (TAP).
A member of the WEKA Alliances team will follow-up with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.
Thank you!
A WEKA representative will be in touch with you shortly.