AI Forward - LLM POC to Production: Insights From an Infrastructure Viewpoint

Table of content

In this panel, the AI experts discuss the advancements and challenges in LLMs from proof-of-concept to production, focusing on infrastructure, model management, and regulatory compliance. They highlight the shift towards purpose-built models, the complexities of LLMOps, and the importance of staying updated with rapidly evolving AI technologies and regulations.

Key takeaways

The Evolution of AI Infrastructure: Dedicated tooling, chipsets and new database types to support the evolving AI infrastructure. Because AI is evolving rapidly, AI-specific developments might become obsolete or see incremental improvements.
Shift Towards Purpose-Built AI Models: There’s an increasing trend towards developing more purpose-made, manageable models as opposed to larger, all-encompassing models, indicating a maturation in AI use cases and methodologies.
Challenges in Productionizing LLMs: The difficulties in transitioning LLMs from proof-of-concept to production, focusing on reliability, robustness, and safety. Emphasis on the importance of setting clear metrics for success from the early stages.
Issues that need to be addressed in LLMs: Addressing the biggest issues like hallucinations in LLMs, understanding and managing latency, the absence of benchmarking in custom models, and the importance of setting appropriate guardrails, especially regarding PII.
Challenges and Best Practices in LLMOps: Discussion on the emerging field of LLMOps and the necessity for new tools and methodologies in this area. Insights on how LLMOps differs from legacy prediction model operations and the importance of integrating LLMOps tools with the LLM workflow.
Importance of Regulatory Compliance: The need of staying updated with evolving regulatory requirements, especially in data security, and the importance of collaborating closely with legal experts to ensure compliance.

AI Forward Summit - LLMs in the Enterprise panel discussion titled 'LLM POC to Production: Insights From an Infrastructure Viewpoint,' featuring Juan Bustos from Google, Alan Ho from Datastax, Nirmalya De from NVIDIA, and moderated by Lior Berry from Fiddler, discussing the advancements and challenges in LLMs from proof-of-concept to production.

‍

Moderator: Lior Berry - Director of Engineering, Fiddler

Speakers:

Juan Bustos - Lead Solutions Consultant, AI Center of Excellence, Google
Alan Ho - VP Product, AI, Datastax
Nirmalya De - Principal Product Manager, Conversational AI and Deep Learning, NVIDIA

Video transcript

[00:00:04] Lior Berry: All right, uh, let's start here. So, uh, welcome to, uh, this panel that we titled, uh, LLM from POC to Production Insights From Infrastructure Viewpoint. I'm Lior Berry here, director of Engineering at, uh, Fiddler AI. And, uh, uh, today in this panel, I, uh, have the pleasure of hosting, uh, a panel of experts, you know, technical leaders who drive some of the most common.

[00:00:29] Lior Berry: Platforms out there that power AI and LLMs and, uh, given that they're, you know, uh, managing those platforms, they see a lot of use cases, a lot of, you know, users coming in with different, uh, uh, needs and, and as such, they would have, be able to help us, uh, gain some interesting insights. So let me, uh, present the, uh, uh, the panelists for today.

[00:00:50] Lior Berry: Uh, so we have, uh, Juan Bustos, uh, who is the Lead Solutions Consultant at Google's AI Center of Excellence. Uh, we have Alan Ho, VP Product AI at DataStax, and we have, uh, Nirmalya De, Principal Product Manager, Conversational AI and Deep Learning at NVIDIA. So, welcome, uh, folks to the panel, and I think we can, uh, dive right into it.

[00:01:14] Lior Berry: All right. Um, so given that this session is kind of, we're, we have this like infrastructure point of view, uh, I wanted to start with a more high level questions, maybe even more future looking. Um, so we've seen over the past, you know, a couple of years and even more so in the last year, a lot of new type of infrastructure emerging with kind of AI specific, uh, capabilities in mind, you know, anywhere from the hardware, from dedicated chipsets through, uh, new types of databases.

[00:01:43] Lior Berry: I'm embedding vectors all the way to whole new platforms for hosting, serving, training models. And, uh, I'd love to take, get your thought about where do you think this is going? You know, what new things are we going to see? What's not going to stick around? What is going to sort of just see some incremental?

[00:02:03] Lior Berry: improvements, and, uh, we'd love to get your take here. Uh, maybe we'll start with, uh, Juan. Uh, maybe you want to share some of your, your insights.

[00:02:12] Juan Bustos: Absolutely, and thank you so much, Lior. Uh, it's such an honor to be part of this panel. Um, so being forward looking, of course, we are seeing, uh, an increased sophistication of the capabilities of the models.

[00:02:25] Juan Bustos: Um, we're seeing some approaches to create bigger, larger models, but honestly, I do believe that we will shift from the vision that one model will rule them all to something more, uh, purpose made, uh, models that are more easily to manage. As we are maturing on the use cases, uh, we are developing different techniques and frameworks that allows us to, uh, really achieve, uh, the results versus like, I don't want to call it brute forcing.

[00:02:55] Juan Bustos: But versus brute forcing, uh, so to me, it's more about maturity in terms of the frameworks and maturity about methodologies for putting together these, uh, use cases.

[00:03:06] Lior Berry: Got it. Yeah, I think, you know, we're definitely seeing things are moving, honestly, on a day by day basis. I'm still surprised to how much, you know, new stuff happens in every week, let alone, you know, um, the OpenAI kind of Dev Day, et cetera.

[00:03:21] Lior Berry: Alan, kind of, uh, maybe you can also provide some of the database angle here.

[00:03:26] Alan Ho: Yeah, I think, uh, for what Juan said, what's happening is, uh, we're really starting to see the architecture that, um, Yanle Kun had posted in his, uh, seminal paper, Pathways to Autonomous Machine Intelligence in 2021, whereby we're going to be having a combination of much smaller models that are doing different things, such as Perception, planning, simulating the world, as well as cost analysis, um, cost function analysis.

[00:04:03] Alan Ho: And, um, we're also going to start seeing the separation of memory versus the model. Namely, the models will be more in charge of things like language understanding, whereas the memory systems, like the databases, And this will mean that less of the information and the facts are going to be encoded in the weights and biases of a transformer network, and rather the facts will be explicitly fetched from databases.

[00:04:36] Lior Berry: And I'm guessing that also means, for instance, later on, we can talk about latencies. What does it mean to fetch from a database? And you know, suddenly you've got all sorts of moving parts here, right? So from an infrastructure standpoint. Uh, it now introduces new, new types of problems, so I think we'll, we'll get to talk about it later on.

[00:04:52] Lior Berry: Um, Miramali, I thought, you know, maybe you can offer also some of the hardware, um, kind of trends or what, what you guys kind of see, um, is what's coming.

[00:05:04] Nirmalya De: Yeah, thank you. And once again, it is a pleasure to be part of the panel. Um, so NVIDIA, as you know, Lior, is a, of course, it's a hardware GPU company, but, you know, it provides a full solution.

[00:05:18] Nirmalya De: software solutions stack. So, going forward, what we look at, as, uh, my, you know, colleagues in the panel said, that there will be purpose built models, there will be vectorized databases. To extend that discussion, I will say, you will see, as you mentioned, latency. One of the bigger issues, you will see, low precision quantized models, and you will see the newer generation of chipsets, hardware, which is enabling that.

[00:05:53] Nirmalya De: So just that's perfect segue to the Nvidia, you know, kind of the hardwares you're seeing or any hardware you are seeing, you know, uh, where you are finding a low precision, like, you know, uh, FB8 kind of, uh, optimization. So the two things, if you take the discussion at a higher level, the two, two aspects of a large language model, you want to see, you want to see the reduced time to train and you want to see the higher.

[00:06:20] Nirmalya De: Uh, better inference, meaning the, you know, lower latency and better throughput. So, at any, you know, solution what industry looks at, we look at, we are trying to ensure that we, we, we, we increase those efficiencies such that the, you know, the number, the time to train reduces and the, you know, the efficiency increases.

[00:06:42] Nirmalya De: What another trend we see is... It's the large language model will not only remain a data center centric. We are seeing the large language model, at least the fine tuning and inferencing coming to the edge, right? It's, it's, it's coming to the edge computing and we are trying to enable, we are seeing there, we will see from the next frontier, we'll see there is a lot of new generation of the of the, of those kind of infrastructure come.

[00:07:10] Nirmalya De: And lastly, what I will say is from a software stack, I agree with both the panelists. So to enable lower latency, you will like to have, as I mentioned, low precision quantization, distillation, group query attention, those kinds, those kinds of techniques. So more likely you will see, um, a little bit smaller model per se, but with a lot of knowledge.

[00:07:35] Nirmalya De: for that particular task. So overall those are the trends and frontier both on the hardware side and the system side we can see.

[00:07:44] Alan Ho: I just want to also add one more point before. Um, the other, the other area is hallucinations by far is the biggest issue with large language models. And although there's new techniques like Retrieval Augmented Generation, there's now a new crop of techniques that's including reasoning, namely like Chain of Thought, Self Consistency, etc.

[00:08:06] Alan Ho: So the, these, and which are actually much more expensive techniques and slower techniques, uh, which goes into the other part, which is latency as well. Like, how do we... How do we, how do we, how do we get enough, um, speed, uh, while reasoning at the same time?

[00:08:26] Lior Berry: Definitely. I mean, the models are, in a way, getting more and more complicated and, you know, now we're adding more layers or more modalities to them, and that has definitely impacts on, on the hardware, uh, and the infrastructure that's being used there.

[00:08:40] Lior Berry: So, uh, I want to kind of move on to, uh, maybe closer to the title of the session, but, um, in reality, again, you know, with ChatGPT and, and the advancements, you know, we see a lot of companies out there trying to prototype or play around with LLMs, you know, for their own use cases. There's a wide range. of these, and uh, I sort of argue that again, coming up with the PLC is the easier part of the puzzle.

[00:09:06] Lior Berry: Um, the hard part is, uh, really taking this into production, and you know, how do you get this to be something that is like, reliable, robust, you know, uh, safe, in all sorts of like different, um, ways, and Uh, from your vantage point, I'd say what, what is some advice that you can give or somebody that's, say, sitting there or even listening right now to this session and panelists, in this POC state.

[00:09:37] Lior Berry: I'll add one more thing. I think I heard in a previous session that right now there's a lot of acrobatics going on. So wiring the people, taking the POC means all sorts of like wiring, even the infrastructure level. So I'd love, again, your guys take on. What's the advice you'd give from somebody trying to move from LLM as a POC to LLM as production grade?

[00:09:57] Lior Berry: And, uh, again, maybe Juan, if you, um, want to take this.

[00:10:00] Juan Bustos: Absolutely. So one of the biggest things that we are seeing, and, uh, as we are seeing these use cases to mature, um, We are seeing also what are the caveats or the big no nos that you can get as results, right? As Alan mentioned, hallucinations are top of mind and that's one of the largest issues that you may come across when building your POCs.

[00:10:27] Juan Bustos: I think that, and this is something that we are working with our partners and customers, is to identify and define what are the frameworks and best practices to start iterating. One thing that What is very exciting about this time, and also extremely scary, is that you get to experiment fairly quickly.

[00:10:47] Juan Bustos: Like, you can just start working against an LLM, sending the prompt, serving, that's pretty much it. That's your first iteration. But then you may start to notice like, Hey, there are some hallucinations or maybe a lot of hallucinations. And then you start adding more pieces, uh, to the, uh, equation. Like for example, you can leverage techniques such as a chain of thought.

[00:11:10] Juan Bustos: Uh, later you can add RAG, Retrieval Augmented Generation, and then you start building and defining what are the different steps. So my take always is, um, first of all, uh, Keep clear what is your use case. Uh, see when you define your use case, also define what are your metrics for success and, uh, monitor those metrics very close.

[00:11:35] Juan Bustos: And this is not only on the serving, uh, portion of it on deployment. But also when you're starting to, uh, in the initial, uh, portions of like, hey, what are your initial results? How are you measuring how you're evolving? And this is also like, you know, we transition from MLOps to LLMOps, right? Uh, so this is something that, that is ongoing.

[00:11:55] Juan Bustos: So to summarize, uh, put together a framework. We have recommendations and best practices for them and monitor and keep your eyes very close to, uh, your results. And don't forget what you're trying to achieve with your use case.

[00:12:08] Lior Berry: And I'm guessing you also mean that some of the measurements and things you need to do starting at the PLC stage, so not wait until it hits production and only then start to measure, right?

[00:12:17] Juan Bustos: Exactly. This is not something that you will be, uh, organically get into production. This is something that you will promote, but also you need to, uh, define what is the scope of what is success.

[00:12:28] Lior Berry: Um. Alan?

[00:12:31] Alan Ho: Uh, yeah, I think, um, this, uh, very similar, um, evaluation frameworks are very important very early on before you go to production and constantly being able to monitor what's going in production as well.

[00:12:45] Alan Ho: I mean, that's why, um, we're working also with Fiddler AI on that latter, latter problem as well for our own LLMs. Uh, the, I would say the, um, the, the, the, one of the issues is hallucinations is definitely probably the number one issue. Um, it requires a lot of different techniques, um, out there. And I think one of the big issues that we see is that people start hitting the latency issues when they start using.

[00:13:15] Alan Ho: Things like Chain of Thought, because you ended up having to make multiple calls to LLMs. Some of those calls are in series, some of them are in parallel. You're making a lot of parallel calls to, to, um, to vector databases. So these, unfortunately, the devil's in the details. So when you talk about your infrastructure, you really got to make sure that you cost out.

[00:13:39] Alan Ho: What are the various latency budgets you have, and then, um, you know, try to achieve 10 times better than that for your POC. Because by the time you layer on all these complicated things, such as a chain of thought, you're going to be ending up with a, with a, with a, a, a slow LLM in reality. Uh, the other thing I would say too is that, you know, you can't, um, you do have to, have to go try it out and just do it and go into production.

[00:14:11] Alan Ho: Um, but make sure you go into production, uh, not flying blind. Uh, and the reason why is a lot of these things, they don't, you just can't game it out. You just can't game out human behavior in a test environment. That's what that means. You can't do it. We don't have millions of... Well, maybe if we create millions of humans and LLMs to test things out, maybe that would be possible.

[00:14:33] Alan Ho: Uh, but we just have to go to production very quickly, which means that... You do want to leverage a lot of the frameworks that implement a lot of these different techniques. LangChain seems to always implement the, the, the latest paper ten minutes after it's pushed to the archive. So, things of that sort you want to be able to leverage in order to get, to move faster.

[00:14:54] Lior Berry: Yeah. And maybe just one quick follow up question. So latencies, how would you say, what, what is the couple of top levers if I'm hitting some latency, what should I sort of consider?

[00:15:05] Alan Ho: Uh, yeah, so I'm going to talk a little bit. Um, so the main piece of the latency are, um, your LLMs as well as your vector databases.

[00:15:15] Alan Ho: Those are probably going to be the two biggest source of latency. And for the LLMs, um, the good thing about LLMs is they're mostly, they're stateless. So you can scale them out. So having a scale out architecture for your LLMs and being able to make L parallel calls of your LLMs is a great way to decrease latency.

[00:15:35] Alan Ho: Same with vector databases as well. Selecting a vector database that allows you to parallelize those eight, uh, the, the, the calls quickly, uh, that, that reduces the latency significantly. Um, the other part too is that a lot, some of the different techniques. Um, uh, such as the React pattern, uh, this is a pattern, uh, from, from Google.

[00:15:59] Alan Ho: They're, they're serial, and they're, unfortunately, they, they can't get away from that. And so, one of the techniques you can use is, uh, caching the LLM calls. Using a vector database, uh, because you're not doing key value, you can't use simple key value pair lookups for caching LLM calls. It needs to be, you want semantic, uh, lookups.

[00:16:21] Alan Ho: And uh, the other area is instruction fine tuning. So when you have many, many LLM calls, you need to do one over time. You could actually skip a lot of those steps by leveraging the fine tuning APIs provided by OpenAI, Google, uh, Vertex AI, et cetera, et cetera. So those are some. Specific, uh, tips and tricks you can do.

[00:16:43] Lior Berry: Thank you. And, uh, Nirmalya, by the way, I just kind of want to call out, I saw some of the poll results, and it looks like about 50 percent of you are, uh, either already in advanced POC or even in production, and I think about 43 percent are, uh, in early exploration. So everybody's kind of doing it, and I think some of the insights that, um, Alan and Juan and I'm sure in a minute Nirmalya will share are definitely relevant.

[00:17:06] Lior Berry: Uh, so, uh, Nirmalya, I'd love to hear kind of some of your thoughts there.

[00:17:10] Nirmalya De: Yeah, I mean, I just want to build upon the latency of what Alan was saying. You know, definitely those are the good techniques. At the same time, as I was mentioning, choosing the right model. There's a fundamental thing I would... You know, in, in the LLM, it's not always the biggest model produces you the best results, right?

[00:17:35] Nirmalya De: So the choosing the right model and, and as you see, you know, the, the industry is moving so fast, right? It's like, you know, I talked about a couple of techniques we like to see like distillation and others, but I'm sure there'll be many more new, like low precision quantization. You like to always think about, you know, what is more important to you?

[00:17:56] Nirmalya De: Accuracy. Latency, what you can trade off, what you can trade off little, those are the conscious thinking you need to make, to make the right decision for you. Now going back to all the co panelists say, and I 100 percent agree to the fundamental problem of LLM is the absence of benchmark of custom models.

[00:18:18] Nirmalya De: Trust me, gentlemen, there is no benchmark of custom models. All the benchmarks are academic. Your, your custom model has never seen a benchmark data. That's why you are creating your custom model. Isn't it? Right? All the benchmarks are either academic or something else, which was not trained with your data.

[00:18:39] Nirmalya De: So understanding what Alan kind of has touched, understanding the benchmark, understanding the baseline at the beginning. That's what we always tell our partners and the customers. Do not just put a lot of data to fine tune and spend 100, 000. That's not the way to do it. Understand the benchmark, understand the baseline, collect the benchmark data.

[00:18:59] Nirmalya De: If you don't have benchmark data, try to start with the closest approximation what is in the, in the world. Then you have a baseline, then you build on top of that, and then you decide what is right for you. There are lots of techniques, what you, you know, prompt engineering, p tuning, instruction tuning, fine tuning, what suits you, and what benchmark you want to go.

[00:19:20] Nirmalya De: This is a common mistake, and as I see, and a common guidance we have. And lastly, what Alan, or Juan was mentioning, is the guardrails. The, there is no question this How much we take care of, how much we take care of in the data blending, data curation side of the LLM, there will be guardrails, means there will be hallucination.

[00:19:41] Nirmalya De: It's just the nature of how these things are built, right? And it is, you know, when you are dealing with 3 trillion, 6 trillion, 9 trillion tokens, it just, you know, you cannot avoid that certain scenario. So providing that guardrail, at least how much you can provide, where a Topic, Topic Guardrails, Security Guardrails, Safety Guardrails.

[00:19:59] Nirmalya De: You want to have your conversation within those guardrails. And we also see that, you know, the partners and customers can create their own guardrails. So putting your discussion in the right guardrails is going to be. So those are the three aspects I think predominantly we'll see to make LLM from POC.

[00:20:15] Nirmalya De: Oh, last thing I want to say. And another phenomenon we see is it's not only a cloud. It's a hybrid phenomenon. I see, especially the larger companies, they want to do some of their workload in their on prem, some of the workload cloud, some of the workload hybrid. So ability to understand and solve all those workloads in their hybrid.

[00:20:37] Nirmalya De: You know, capacity, hybrid configuration is also going to be key for this LLM ecosystem.

[00:20:45] Lior Berry: Yeah. Very important points. Um, I think, you know, understanding the trade offs right at the end of the day, it's, it's, it's trade offs across, you know, latency, correctness, um, you know, um, safety, all of that. Uh, I did want to kind of follow up on one point, Nirmalia, that, that you brought up, which is Um, running on prem versus on the cloud, again, uh, you know, um, say the NVIDIA platform, for instance, offers this pretty seamlessly.

[00:21:09] Lior Berry: So, somebody out there right now listening, you know, in enterprise or, um, different kind of environment, thinking about what are the trade offs, you know, why should I, or what are considerations that I should put in place when I consider running this on prem versus in the cloud, uh, any specific challenges I should be watching out for.

[00:21:27] Lior Berry: Um, maybe we can elaborate a little bit.

[00:21:29] Nirmalya De: Sure. Look, from a pure, from an infrastructure perspective, probably when it goes to the data center, it's all curated and created for you. It's as simple as that. You bring your data and start, you know, either training or deploy, uh, deploying. Whereas when, when you were looking at it on prem, you have to bring up the whole cluster.

[00:21:51] Nirmalya De: Of course, NVIDIA provides guides, but these are not... Very easy systems. They said you want to provide, you have to look at the right interconnect, which is very important. It's not, doesn't work on, officially on Ethernet. You have to like, infinite band, high speed, you know. So understanding those cluster management, like, you know, all these kinds of things is very important.

[00:22:12] Nirmalya De: So, but that does not mean that someone should not do. Someone should decide what is the total cost of ownership. And especially we see where on prem matters. Thank Financial Institution, Healthcare Institution, where we see, you know, not only the from a cost perspective, who are sensitive more about the data that they do not want, even on a, you know, in a distributed scenario, so where we see, those are the things, and you definitely want to have the data.

[00:22:41] Nirmalya De: Capability, right, to stand up a cluster, and that's why we are trying to, what I mentioned, you know, if you don't, if you don't want to stand up a super plot cluster, a very complex cluster of infrastructure, you can start up with age where you can do small fine tuning and, you know, inferencing. So those are the things I can see from infrastructure perspective, someone can think about.

[00:23:02] Lior Berry: Yeah, these are great points. I wanted to actually, uh, maybe go on order, Amalia. Um, I was actually thinking, at the end of the day, you know, the reason somebody might be, uh, plugging in an LLM is to drive some, because... This goal, right? We're trying to achieve something in that space. And, uh, from, how do we bring, you know, an organization that wants to get to maturity there and make sure that they're delivering with gen AI kind of business value, but at the same time, you know, there's regulatory aspects, there's some of the risks that we talked about.

[00:23:30] Lior Berry: Like, what are some common best practices, metrics, um, I know, uh, uh, tools, processes that you guys think should, um, be considered and, and, uh, sort of instrumented? So, I don't know, Juan or Alan, um,

[00:23:47] Juan Bustos: Yeah, I would start, uh, putting back emphasis in the framework that I was talking about, uh, defining a lifecycle and understanding what are you measuring, uh, understanding that it's a very common approach to say, hey, I'm going to experiment with LLMs and then start thinking about fine tuning.

[00:24:06] Juan Bustos: And, uh, sometimes there are previous steps where you can have a higher, uh, return of investment without the need of. Fine tuning, maybe with RAG you can get where you are needing to do it. Um, so putting a little bit of emphasis in what I said a couple of minutes ago, uh, understanding what is the lifecycle, what is the order, um, that you are going to, uh, to approach.

[00:24:31] Juan Bustos: Uh, that, that is our recommendation and the best practice is start measuring, start, you start selecting your model, you start prompting it, then you may want to experiment seeing the evaluation of the results, then you may want to add some of the, uh, for example, RAG, uh, Retrieval Augmented Generation to see, uh, reducing hallucinations.

[00:24:52] Juan Bustos: Maybe then you may need to do fine tuning to get you there, but understand this as a process, as a life cycle and measure on every step of the way.

[00:25:03] Alan Ho: Yeah. The, from my perspective, you know, a lot of times, uh, and, and, and by the way, DataStax has made this mistake itself where even though we're providing infrastructure for, uh, large language models, we're using it to drive our business.

[00:25:15] Alan Ho: And, uh, one of the demos that we, not the demos, one of the efforts was to. Use a, to create a chatbot on our website that would help developers on board. In fact, we were using the Google Vertex AI platform for this. Um, the, sometimes you also have to look at not overshooting your goals. And let's see, making, going for lower, lower level goals.

[00:25:41] Alan Ho: So initially we wanted to just not have a customer support person at all. But then it turns out there's too many hallucinations, too many errors, to the point that we couldn't actually do that. And instead, we use that as a mechanism to come up with, uh, as somebody's typing things in, we use it as a mechanism to suggest answers.

[00:26:03] Alan Ho: And then the customer service representative, like, lets the, lets the message go through or not. And it was more of a develop, it was more of a customer service rep, um, efficiency gain, rather than, uh, a full automation that, that we were able to get. So it does come down to business pract, like, what kind of use cases are you handling?

[00:26:24] Alan Ho: And what kind of error rate you can actually handle.

[00:26:30] Lior Berry: Makes a lot of sense. By the way, I think we have a session later on today, a workshop about how we at Fiddler created a chatbot for our own documentation. So there's similar kind of in vein type of, um, uh, understandings. Um, I wanted to also ask, uh, maybe Alia to you, how is sort of supporting or uh, running LLM in production different than say, legacy prediction models?

[00:26:54] Lior Berry: Like how, what are the differences?

[00:26:56] Nirmalya De: Um, I, I saw that question and I, I, I was smiling. You know, I, I think that's , that's why probably Fiddler Ease, right? A company like Fiddler, LLM Ops is a different thing, right? Understanding the LLM ops and I, I, I will objectify it, right? It's, I don't want to. talk in abstract.

[00:27:13] Nirmalya De: What are the three common things you see, what you generally don't see in the, uh, let's call it in a Not in, we, not in large language model operation. I think everybody's talked about hallucination. The second thing is toxicity. Hardest PII. These are the three fundamental, you know, you have to kind of think about, especially from an output perspective, you are going to look at, which is, uh, you know, there is no, there is no natural language model, which is trained with, uh, 3.5 trillion, uh, you know, tokens are probably 8 trillion tokens, so it has to be, like, as we mentioned. So, how do you solve it? Under hallucination, I think the other, other, other panelists touched toxicity. Understanding the toxicity score, and it's not important to say that I will not provide toxic results.

[00:28:04] Nirmalya De: I mean, we do not subscribe to that, you know, philosophy. We rather subscribe to that philosophy is that, you know, Uh, an application should have the access to the toxicity score and maybe have a chance to improve the ability to the toxicity filter such that you understand what is the right threshold of the toxicity score.

[00:28:25] Nirmalya De: To be an example, what is a video game's toxicity score is different than a financial chatbot. Right? So those are the important considerations. And lastly, PII, you know, Personal Identifiable Information. It's either on the input side or on output side. It's an extremely difficult problem. So understanding that, understanding that, you know, having...

[00:28:49] Nirmalya De: Everyone is writing a lot of scripts, parsers to understand PII. So, taking the proper safeguard, taking the proper, you know, ownership. And lastly, you asked the original question, Lior, is that about responsible AI. What are the policy questions? What are the ethical questions? A very important question to think about an answer.

[00:29:09] Nirmalya De: And that goes to the heart of the question, who owns the copyright of the output? Basically, we need to understand, the industries need to understand, whoever owns the copyright of the output, needs to have a clear understanding of the, you know, responsibility of the output. So, basically what I'm trying to say is that, you know, if you're creating an application, and you know, if you are the responsible for the output, copyright of the output, then you should define the proper you know, policy responsible AI of that output, right?

[00:29:42] Nirmalya De: So this is, these are the some policy questions we're always thinking and trying to address.

[00:29:48] Lior Berry: And since we kind of brought up the notion of LLMOps, um, I'm curious again, Alan, maybe you can, from your standpoint, what does LLMOps mean? And apart from the metrics, what, for instance, tools are missing out there?

[00:29:59] Lior Berry: Offline, online. Um, you know, visualization, exploration, um, alerting.

[00:30:07] Alan Ho: Yeah, um, so, so, so one of the, actually, I would recommend, uh, people who are looking at metrics to check out a paper called Google Lambda. Um, they have three specific types of metrics, um, sensibility, a quality metric that takes into, like, sensibility, soundness, etc.

[00:30:27] Alan Ho: Uh, safety, very specific to what, uh, uh, uh, other panelists talked about. And then also groundedness, which is reducing hallucinations. I totally recommend people to take a look at that, uh, particular paper. Um, and, uh, I, I would say that the problem is once you have the top level metrics that you want to monitor, the second thing is almost all the, all the customers that I've seen.

[00:30:53] Alan Ho: Um, that are using, doing some sort of ChatGPT stuff, uh, sorry, some sort of language model because we're a database vendor, they're using Retrieval Augmented Generation. And one of the biggest issues that we've seen is people, the data is in the database, but the answer they're getting is not right. And the reason why is because when you are calling the database to get an answer, especially in a vector search use case.

[00:31:22] Alan Ho: You are doing this, um, um, it's a heuristic algorithm and, uh, you're, you're getting metric, you're getting, uh, content from the database, uh, and the k nearest neighbor algorithm, which is the underlying algorithm that's powering a lot of these things are, it's actually very, very, um, susceptible to noise. So even things like accidentally including the left nav bar that's on every single page into the vector database, it gets duplicated 100, 000 times.

[00:31:57] Alan Ho: And it's semantically matching pretty much every single query that, that, that can kill you. So being able to drill down into, for your queries, what are you, what's coming out of the database before it's being sent to the LLM, that's um, that becomes a very, very important metric to monitor for these root tree of augmented generation type applications.

[00:32:18] Lior Berry: And maybe kind of one question I was always curious about, there's quite a few, uh, different, you know, vector databases out there right now. It's, uh, we've seen sort of all sorts of like entrance to the, to that kind of market. Like, how do I even know which one to choose? What, what sort of criteria? I think you've kind of started touching on some of that, but again, if I'm somebody that's kind of new, I'm trying to put up my own LLM, maybe a RAG kind of base setup.

[00:32:42] Lior Berry: What should I be looking at?

[00:32:45] Alan Ho: Yeah, so the devils are in the details, um, and I think the ultimate question, uh, so for, from DataStack's perspective, we run 90 percent of the Fortune 9, uh, the Fortune 100, you know, 700 enterprise customers, so we're obviously going to be serving the high end. A lot of customers, uh, when they begin their POCs, they're basically doing POCs on a static number of PDFs.

[00:33:12] Alan Ho: For that, to be honest, pretty much any p, any, any, uh, any vector database is, is good enough for that. But if you start getting into, so there's two, two components. When you start getting into more real time data, then it matters a lot how long it takes to index the data before it's available to be retrieved.

[00:33:34] Alan Ho: That's a super important thing, and nobody talks about it, very few people even think about it when they do that, because they, when they submit the, the, the, the data for indexing and embedding, embedding followed by indexing, they just assume it's available immediately. No, in the reality, it can take minutes, it can make, take hours, it can even take days.

[00:33:54] Alan Ho: And then it depends on the use case. So if you're like indexing static data, then it doesn't matter if it's inconsistent, but using a vector database to do caching, having that immediate late, uh, indexing becomes extremely important. So it depends on the particular use case that the customer is, is, is, is going for.

[00:34:16] Alan Ho: And unfortunately, it depends. It's all dependent on, uh, what use case, cost, etc etc you’re willing to maintain.

[00:34:25] Lior Berry: Probably a good topic for a blog post. I'm curious, I saw kind of some of the poll results and it looks like only about 3 percent have like mature, I'd say kind of LLMOps out there, with a lot of people sort of... Just entering, I wonder, kind of, Juan, from your point of view when you're talking to, um, potential customers or people that are looking for solutions, like, how do you, how do we uplift this or what, you know, make it more accessible, you know, better tools, simpler to use?

[00:34:53] Lior Berry: Is it kind of, uh, not finding the right tools, it's integration that's kind of the bottleneck? I'm curious, kind of, your thoughts there.

[00:35:00] Juan Bustos: Yeah, that's a great question, right? Um, I, I think it, All begins with a very solid use case and understanding what value are you bringing. Because the conversation about leveraging LLMs and GenAI for companies, like, everybody's going through that conversation now.

[00:35:20] Juan Bustos: Now, the justification for, hey, we need to invest X on making this a reality. That conversation and the return of investment is a harder question. So my take is always about understanding what is, uh, your use case, understanding what is the value you're getting from that use case, and then understanding what are you going to bring.

[00:35:42] Juan Bustos: You know, Lior, it's very common that, uh, Our customers and partners think about a chatbot as the first use case and sometimes I push back a little in terms of that because I generally suggest, hey, let's start with an internal use case. Let's start with a use case that you can leverage summarization, metadata extraction.

[00:36:04] Juan Bustos: This will make your InfoSec teams way more comfortable. Understanding the platform, your, uh, DevOps teams, your LLMOps team that you're starting with this, uh, will start get familiar with the metrics and you will start getting, um, into the flow of what it means to, uh, productionize, uh, a use case that's leveraging LLMs.

[00:36:27] Juan Bustos: Um, so I think that the biggest value is there. And again, I'm sounding like a broken record, but defining the life cycle, following a methodology for the growth and the maturity of the product is very important.

[00:36:43] Lior Berry: Um, clearly this is a world that's evolving and kind of, of course, we at Fiddler are building some of those, uh, tools as well.

[00:36:52] Lior Berry: Um, I'd love to, uh, you know, one challenge that I see that this kind of world is moving so fast that I almost wonder, you know, if I'm starting to build something, it sounds like within three, four months, you know, it might be in a way outdated or not, no longer using the best kind of practice. What's your advice for how to approach this, right?

[00:37:11] Lior Berry: It feels like, uh. Um, you know, no matter what I do in six months, it might be kind of, I'm no longer sitting on top of the best infrastructure that I could be, but at the same time, I don't want to kind of rattle the ship all the time. So, what's your advice there? Um, I don't know, maybe Alan, you want to take a, a stab?

[00:37:32] Alan Ho: Yeah, uh, so like a most, I actually, there was a customer, there was attendee who asked like, Prototyping with LangChain, uh, and then LangChain is the, uh, it's, it's almost like the pie torch of, uh, because everybody hates it, but at the same time everybody uses it. And there's a good reason why people are using it.

[00:37:54] Alan Ho: And by the way, Datastacks is a very big supporter of LangChain project. I personally have talked with Harrison, he's a great manager of the, he manages the project very well. The, you want to make sure that you're part of a, you just, there's no time to read all the papers in the world. That's the problem.

[00:38:12] Alan Ho: Uh, it's, it's coming out so quickly. And so you want to make sure you have an infrastructure, uh, you're leveraging framework components. That are implementing things very quickly, uh, and, you know, you just see like OpenAI, they're always adding new models, um, you know, uh, same, same with Meta, uh, same with Google.

[00:38:33] Alan Ho: You want to, you basically, if you don't want to get left behind. You do want to think about leveraging a provider that gives you access to a lot of the latest models. So that's on the model side. But then there's the technique side, such as, you know, self consistency, chain of thought, all the variations of that.

[00:38:52] Alan Ho: That's more in the frameworks, uh, that's in the frameworks regime, like LangChain, Llama Index, et cetera, et cetera. So I strongly recommend people to use one of those, because you start getting to pick up those even with all the warts it has. Because it lets you pick up the, the, the latest, uh, the latest and greatest, um, so that, uh, you know, you, you, you're staying on top of things and also gives you the flexibility because you may launch your product and realize, oh, there's too many hallucinations, but then a couple of months down the road, there might be some advancements in the LLMs or the frameworks itself.

[00:39:27] Alan Ho: Sorry, in the techniques, hallucination reduction techniques, uh, that, that makes practical and you want to turn that on as quickly as possible. Yeah,

[00:39:37] Nirmalya De: if I may add to that is, you know, NVIDIA also has a large language model service, which is what, it's very interesting when you see the data, when, when, when it comes to the data.

[00:39:51] Nirmalya De: It is traditional NLP applications migrating to the LLM. You look at the data, which will surprise you, right? You will think that a lot of the content generation applications will, but, but it's not, what industry knows, industry is migrating from one to another, which is supposedly best. So my suggestion is very simple.

[00:40:14] Nirmalya De: Do not, you have to look into always the latest thing. Just figure out what you want to do and do the best. in the LLM ecosystem. To make it practical, when you transition an NLP application to a LLM application, probably the understanding piece will be exactly same, like the intent classification, but the response generation will be different.

[00:40:37] Nirmalya De: Now you have the power of LLM to do response generation. Take advantage of that. So for the next two years, whatever the applications you are familiar with, if you're migrating to LLM, you don't have to always think, you know, what is the next paper? What is all that? You know, just, just do the thing, what you want to do, know the, you know, what you know best.

[00:40:58] Nirmalya De: take the advantage of LLM besides understanding it does the generation. That will be my simple suggestion and it works. The second thing I will tell you or your first question is where we are seeing the missing of the LLMOps. LLMOps somehow means what we find in our company is not integrated with the completely with the LLM chain.

[00:41:20] Nirmalya De: So Tools have to be there where the LLM is there, meaning if I am going to create a custom model, I should not have to take the custom model and can take the model from another tool. So that's what one thing as industry, if we can improve, make the LLMOps tool available where I am interacting with the whole LLM chain.

[00:41:41] Nirmalya De: Either it is the evaluation. Or it is the human tooling for instruction tuning like RLHF. That is where we are thinking. I'm sure whole industry is thinking those are the improvement points, either from application or from LLMOps, you know, toolchain I see.

[00:41:55] Lior Berry: Yep, definitely value points. One sort of question or comment that I see in the, uh, from the audience, which is, uh, there is also the regulatory side, which might change, right?

[00:42:05] Lior Berry: In fact, we know that there's some regulations coming, it's not exactly clear. So that alone might force somebody to kind of now, uh, tune the infrastructure because now there's like a need to have an audit trail or, Um, I don't know what, so that might be a kind of external force, um, thinking about the legal implications out of one and any, anything, any advice there or, um, how to sort of stay, um, compliant, right?

[00:42:30] Juan Bustos: Talk to your lawyers very often. I mean, um, this is latent space. Uh, of course, uh, data security is data security and that doesn't change. Uh, one of the things that we always tell our partners and our customers is like, if you're running your, and this is not a plug, but if you're running your workload on Vertex AI, your data is already secure.

[00:42:54] Juan Bustos: Uh, we don't use your data for training. Uh, but as the regulatory landscape evolves, uh, it is very important to be up to speed with what is being enforced and being able to instrument those changes as quick as possible, as you said, adding like the audit trail. adding, uh, visibility. And there's also some pre work done with Responsible AI, which actually, uh, this goes in line with the regulation, the regulatory requirements that are coming.

[00:43:27] Juan Bustos: Uh, but yeah, it's, uh, you need to be, uh, up with the news and this is not something that you will set and forget.

[00:43:35] Lior Berry: Yeah, thank you. I think this is more or less the time for us. Uh, it was great to hear all your, uh, competent, smart, uh, ideas, and pointers. I think we even have some paper, uh, references for people to read. Kirti?

[00:43:50] Kirti Dewan: Yes, awesome. Thanks, Lior, Juan, Alan, and Nirmalina. That was awesome. I love the fact that we... Talked about so many different points, uh, beyond infrastructure and all the way to compliance as well. So that was great.

‍

AI Forward - LLM POC to Production: Insights From an Infrastructure Viewpoint

AI Forward - LLM POC to Production: Insights From an Infrastructure Viewpoint