AI Forward - Everyday Challenges and Opportunities with LLMs Pre- and Post-Production

Table of content

In this panel, AI experts delved into the dynamic realm of generative AI within enterprise settings, highlighting key aspects such as the critical need for enhancing model efficiency, personalization strategies, and addressing the challenges confronting developers entering the AI domain. They also shed light on the pivotal role of retrieval-augmented generation (RAG) and the importance of robust evaluation and adaptation of AI models to align with specific business needs, underscoring the blend of technological innovation with practical application.

Key takeaways

  • Efficiency and Personalization of LLMs: The need for efficiency in managing large models at scale, including novel quantization techniques and the challenge of personalization in the context of universal model paradigms.
  • Ease of Use for Developers: The influx of developers into the generative AI space without extensive ML experience, emphasizing the need for easy deployment, understanding failures, and managing the complexity of AI systems.
  • The Importance of Evaluation and Benchmarking in AI Development: The critical role of evaluation and benchmarking in AI development, as teams need to measure and improve their applications continuously.
  • AI as a New Way of Writing Software: The concept of large AI models as a new form of software development, offering unique capabilities but also posing challenges in terms of reliability and determinism.
  • The Creation of Golden Evaluation Sets for Business-Relevant Tasks: Businesses should focus on creating golden evaluation sets tailored to their specific needs and tasks, rather than relying solely on academic benchmarks.
  • Retrieval-Augmented Generation (RAG) and its Impact: The impact of streamlined RAG architectures on the generative AI ecosystem and its uptake in enterprises, especially in terms of integration and data structuring.

Moderator:

Sree Kamireddy - VP of Product, Fiddler AI

Speakers:

  • Sara Hooker - Director of Cohere for AI, Cohere 
  • Jerry Liu - CEO and Co-founder, Llama Index
  • Jeff Huber - CEO and Co-founder, ChromaDB
Video transcript

[00:00:04] Sree Kamireddy: Hello and welcome to Everyday Challenges and Opportunities with LLMs. I'm Sree Kamireddy, VP of Product at Fiddler AI. Today we have the privilege of hosting a panel of experts, each with a unique perspective on the dynamic landscape of generative AI. Our esteemed panelists include Sara Hooker, Director of Cohere for AI, Jerry Liu, Co Founder, CEO at Lama Index, Jeff Huber, Co Founder at Chroma.

[00:00:36] Sree Kamireddy: Let's dive into the discussion. Maybe we start with a hot take. A hot take is that we are only scratching the surface of enterprise generative AI's potential with many applications yet to move beyond prototyping. What challenges are keeping these prototypes from reaching the market and delivering real business value?

[00:01:01] Sree Kamireddy: Sara, what are your thoughts on this?

[00:01:05] Sara Hooker: Hey, everyone. It's lovely to be here. So, um, and especially with the perspectives on this panel. So, I lead research at Cohere for AI, so, um, I would say it's first interesting to note, like, why we're all here. Why there's a whole panel called, uh, Challenges, Opportunities with Generative AI, because it's been such a surreal year.

[00:01:27] Sara Hooker: As someone, like, who works on these type of models, it's just really odd on a roller coaster to have so much excitement in such a compressed amount time. A lot of what my research team works on is, I would say, two things that I think are really critical as, as you think about how to bring what people are excited about with these large language models, the fluidity, the dynamic nature of the interactions, um, which really are a breakthrough in, uh, language technology.

[00:01:57] Sara Hooker: It's one efficiency. So a lot of the work in my lab is how do we make these models more manageable at scale? That includes things like novel quantization techniques, but also things like, do we need all the data we need to pre train? How do we accelerate training? And this is because a lot of the, the breakthrough in these models has been caused by this trend of Bigger is better.

[00:02:20] Sara Hooker: And so a lot of what's interesting there is how do you, uh, condition that and how do you get away with less? It also involves things like adaptation, um, which is the second thing. I think the second big question which we work on is personalization. So we've moved, this is kind of a critical thing, one of the biggest differences in how we model is that we've moved from many small models.

[00:02:44] Sara Hooker: That each do one thing like toxicity or sentiment detection to a universal model paradigm. But what that means is now techniques which allow for personalization. So you use a general purpose model, but then you tailor it becoming much more critical. So a lot of my team works on that. I think, um, techniques like RAG, uh, but adapter style fine tuning as well, things that do it more efficiently are very critical on this space.

[00:03:08] Sara Hooker: Um, but yeah, maybe I can open up so, uh, Jeff and Jerry can also join in.

[00:03:16] Sree Kamireddy: Yeah, that's, uh, that's quite interesting. I think, uh, I think you're absolutely spot on, right? Like efficiencies are quite important. Yeah, Jerry, uh, love to hear your thoughts.

[00:03:27] Jerry Liu: Yeah, I mean, I think what Sara gave was a great answer. Um, I think just to add onto that,

[00:03:31] Jerry Liu: I think there's, you're, you're seeing a lot of developers enter the space and a lot of them don't have a ton of ML experience.

[00:03:36] Jerry Liu: Um, and so I think one of the, uh, beautiful things about LLMs is the fact that Um, basically any engineering team can just deploy some capabilities of generative AI without having to invest in data collection, uh, training, uh, you know, engineering, and really the know how, like data science and, and kind of evaluation, or at least not yet, uh, to, to realize some, some, like, um, Like the 80 20 rule, right?

[00:04:00] Jerry Liu: Like some initial value from from that that they build. But I think that that's precisely the issue is that it's like the time to value to do something is a lot shorter, but then once you actually build something like a RAG pipeline, right, or um, Some basic agent workflow, you start realizing that, you know, there are certain failures like hallucinations, like retrieval problems, and the more parameters you tack onto the software system, the more stuff you have to tune.

[00:04:25] Jerry Liu: And I think most developers actually don't, um, uh, aren't, aren't like fully aware of that when, when they first, um, like, uh, start building these things because they don't, a lot of them don't come from like a traditional ML background. And so as a result, I think it's interesting to see how the overall practices around this, like, AI engineering space evolved over time, as well as, like, how much, uh, you know, just, like, general evaluation principles that everybody has to internalize, versus, like, the core research problems that, like, mostly the ML engineers have to or, like, sorry, the deep learning researchers have to have to basically solve.

[00:04:56] Sree Kamireddy: Yeah, that makes sense. So, well, I, I do want to hear Jeff's take on this, but, but before we go there, yeah, you mentioned about, um, we don't, the teams don't necessarily need as many skills as say building a predictive model. But what do you, what do you feel are the core skills that the teams still need to have so that they can be successful in this generative AI journey?

[00:05:21] Jerry Liu: I think at the current moment in time, like from talking to enterprises, pretty much everybody cares about evaluation and benchmarking, and I think that is one of the issues, and so basically every team cares about this when they're building LLMs, because once you build an app with LLMs, you realize it fails, then you realize you need a way to measure and then improve things, and I think that is something that probably everybody building apps at least needs to know right now.

[00:05:46] Jerry Liu: Um... There is a future where, you know, all these models get better and none of this really matters. Um, but like, I think as we're, as, as like people are progressing towards that future, I think people still need the ability to more principally, or in a more principled manner, kind of like track and, and, and, um, uh, optimize it, like improve to their app.

[00:06:07] Sree Kamireddy: Yeah, absolutely. Evaluations and benchmarks, so important. Jeff, both questions, I think, uh, I would love to hear your thoughts on both the questions, uh, What challenges are preventing prototypes to production and, uh, what are the core skills?

[00:06:24] Jeff Huber: Yeah, I mean, to zoom out a little bit, you know, one of the ways that we think about this is really what large models represent is a new way of writing software.

[00:06:31] Jeff Huber: Um, it allows us to create programs and software that we could not have created before with just using code as a tool. But of course, code has all these advantages where code is You know, fairly deterministic, as long as you're not having, you know, bits split by radiation in space, it's like your code will always run the same way.

[00:06:47] Jeff Huber: Uh, obviously that is not true about using these new primitives in the loop. Um, you know, you don't get these sort of, it's stochastic by definition, it's statistical by definition. This is part of the advantage, these sorts of systems can handle a lot more complexity. Um, they can take very dynamic inputs, right, versus conventional code, it's very strict about its inputs.

[00:07:04] Jeff Huber: But it's also a challenge to building, you know, reliable outputs. Um, and so I think that that, like, broad frame is pretty important. Obviously, if this is a new way of building software, it's gonna take time for, like, best practices to emerge. And so, in some sense, I don't know, I don't necessarily think that it's, like, super important to be, like, rushing to production as fast as possible.

[00:07:22] Jeff Huber: Um, like, it took, you know, software best practices, what, 10, 20, 30 years to really emerge, and even still, they're still emerging. Um, and I think the same will be, the same will be true for this arc as well. Um, so, in terms of skills, um, you know, I think that, like, the thing that, like, I would encourece every...

[00:07:38] Jeff Huber: Practitioner or aspiring practitioner to do is like, just play with the tools and understand them. Um, obviously if that's in, uh, you know, sort of corporate sanctioned way, that's great. Uh, if it's in, like, your spare time, that's maybe even better. Um, but, you know, just read about and go play with these things on your own, you're on your own laptops.

[00:07:55] Jeff Huber: It's really easy to run language models locally now. Um, you know, it's pretty easy to even, like, fine tune them on small amounts of data. And I just think that having that muscle memory and understanding, like, the shape of tools, that is the most important thing for practitioners to do today. Um, I think that, you know, trying, in some sense, like, uh, trying to, uh, trying to, uh, contort these things, like, make them do production, again, maybe it's a good idea if that's, if that makes sense for your organization.

[00:08:22] Jeff Huber: But I think that, like, from a longer term perspective, if you think about people's, like, career arc and career development. Yeah, it's just like learning as much as you can and playing around with the stuff.

[00:08:31] Sree Kamireddy: Yeah, that's, that's awesome. So, benchmarking, evaluations, learning, learning, maybe even go all the way to, uh, fine tuning and really get a hang of it.

[00:08:40] Sree Kamireddy: Sara, any thoughts on the, uh, core skills that, uh, you think a team should have to make them successful in this journey?

[00:08:48] Sara Hooker: Yeah, I really like Jerry's perspective on this. I think what I would say about benchmarking is that often end users become overwhelmed by this notion of benchmarking because we're in this weird kind of collision where we have tons of academic benchmarks that were designed for kind of a different era where it's to evaluate one model.

[00:09:08] Sara Hooker: And it's now they're all packaged as kind of collections of benchmarks. But often these academic tasks are very different from the tasks that you probably care about in your day to day. So I actually would just further and say to Jerry's point about what, uh, what skills you need to gain is as domain experts, you probably know the three critical tasks in your business that you want a model to perform well at.

[00:09:33] Sara Hooker: And so, uh, it really, the skill is creating golden evaluation sets that work for you and your use cases, not necessarily having to master. all the different benchmarks that are out there because that's probably less relevant. In fact, we often see tension between performance on academic benchmarks and performance in what often we think of as like the fluidity of these language models or their ability to to kind of deliver on dynamic discourse.

[00:09:59] Sara Hooker: It actually is in tension because academic benchmarks have favored these one word answers. And like complete precision on that. And so we're in a different era. And so what matters most is you take the time to actually collect your gold standard data and make sure that's never used for fine tuning, uh, but is used for you to do cross comparisons across time.

[00:10:20] Sree Kamireddy: Yeah, that, that's, that's a great take. Uh, thank you so much. Uh, All of you, like, I think the Golden Evaluation datasets, I think that, that is clearly, uh, a unique skill and, uh, like Jeff mentioned, this is a completely new way of, uh, new way of writing code and, and I think that the insightful, uh, aspect that Jeff mentioned was, Hey, why do we even need to rush?

[00:10:44] Sree Kamireddy: Like, let's take our time and do this right. I think these are really, really great insights. So one aspect that I was wondering about was retrieval augmented generation has gained a lot of traction recently. Over the last few months, I've been seeing recent updates from AI platforms that have streamlined the approach through, say, automatic chunking, storing embeddings, retrieval, maybe sometimes along with citations.

[00:11:14] Sree Kamireddy: It would be great to hear your perspective on the potential effects on the broader LLM ecosystem and the uptake of generative AI within these enterprises because of some of this streamline and simplification. Jeff, any thoughts on this?

[00:11:31] Jeff Huber: Yeah, I mean, I think it's really cool, actually, you know, referencing, like, OpenAI Demo Day on Monday, or Dev Day on Monday, um, right, like, you really saw for the first time, like, a formal, uh, sanctioning, uh, of RAG from OpenAI, like, previously, there was sort of non committal to it, you know, I remember talking to friends at OpenAI earlier this year about retrieval, and these friends like, yeah, I don't get it, probably won't matter, um, and now you're seeing, like, OpenAI sort of say, like, no, this is actually, we're going to educate developers on the technology and how to use it.

[00:12:00] Jeff Huber: So that's actually a big moment, I think, for, for Retrieval, um, and for our approaches with Retrieval. I mean, obviously we at, at Chroma, uh, maybe we're biased, but we really believe in Retrieval. Like, we think that, like, Retrieval really is a really important key, and we are in the very first, really the dumbest way of doing Retrieval today.

[00:12:16] Jeff Huber: There's a bunch of really exciting stuff happening in academia and coming down the pipe, um, which we think will lead to a much greater interpretability and reliability models. Um, but without going too deep down that path, again, I think it's like, it's pretty cool to see, you know, OpenAI being, this is like, dang, this is great.

[00:12:31] Jeff Huber: And obviously, Cohere was, uh, was, uh, before them, right? I think it was like a couple months ago that Cohere launched a managed service around RAG as well. Um, so yeah, I think that's great. Um, obviously at Chroma, like, that's sort of RAG as a service, right? Which is like a very much its own thing. Um, and, uh, You know, we at Chroma want to serve open source developers and help them build and host their own data and, you know, uh, serve a multiplicity of embedding providers and language model providers and integrations and plugins and all the things, right?

[00:13:00] Jeff Huber: And so, you know, I think there's a natural verticalization use effort you see amongst like the large data providers, mostly in consumer applications to consolidate the user experience that, you know, naturally you can do, right? If you go very vertical. You can make a greater user experience for like a consumer or an end application.

[00:13:16] Jeff Huber: Um, and then you also have folks obviously like Jerry Vlamin, Nexus Project, Chroma, and others, right? Like building these like horizontal layers. And so, I think that it's all good, right? This is like a rising tide, um, and we're all, again, very early in all this stuff, so.

[00:13:28] Sree Kamireddy: Yeah, rising, rising tide it is. Absolutely. Jerry, any, any thoughts on this?

[00:13:34] Jerry Liu: Yeah. Yeah. I mean, this idea of RAG, uh, I guess we're calling it RAG now. It used to be called like in context learning or something like that. And then before that, I don't know what it was called. It was basically the foundation for how GPT index started. Like around this time last year, actually in November, I was thinking to myself, like, oh, you know, like, how do you build an application with GPT 3 and you're just like stuff stuff into the context window and then kind of like hack around it so that it actually understands the information.

[00:14:01] Jerry Liu: Right, so you figure out how to, like, stuff as much information or relevant information as possible into the context window so you can reason over new data. That kind of led to, basically, the project, like, GPT Index, uh, and now LLMA Index. And now, now, yeah, I mean, I think we've been thinking about RAG problems since, since around last year.

[00:14:17] Jerry Liu: And I imagine, like, uh, you know, I, I talked to Anton from, from Promo back in, like, January or February for the first time. It seems like we're all thinking about, like, similar things back then. Um, I think, I think it's, it's super interesting. Like, I think, um... It's basically the, uh, what I think is like a kind of new type of ETL stack for LLM systems, um, like traditional ETL and data pipelines is really about like, how do you take some data, transform it, uh, right, and, and then like put it into, for instance, like a Snowflake warehouse so that you can like, you know, get operational insights from it.

[00:14:46] Jerry Liu: Here, it's basically how do you take all this mess of like unstructured data, structured data that you have lying around in your data silos when the, within the enterprise, or if you're just like a. A person like, you know, your notes and stuff and, and your PDFs and, um, your PowerPoints, like how do you take all that and just like put it into some format so that basically whatever question you want to ask over your data, how, no matter how complex you can basically answer that.

[00:15:09] Jerry Liu: To me, that's like kind of one of the promises of RAG and, and kind of what we've been trying to push for. I think right now, as Jeff said, RAG is very big. I think a lot of people building RAG systems, or at least, like, the easiest way to build RAG systems is very basic. You just do, like, top-k lockup. But there's a lot of potential innovations, both on the retrieval side, and OpenAI, yeah, just launched the retrieval API.

[00:15:33] Jerry Liu: But that actually is relatively, like, you know, it's not like they invented some fundamentally new thing. It's just like, yeah, like, a decently well run, like, retrieval API. There's a lot of innovations there. And then there's also just like potential innovations on how do you actually combine LLMs with retrieval.

[00:15:48] Jerry Liu: And so I think there's been some interesting discussions on that lately that might disrupt the whole like pipeline and architecture of how these things are built. But basically like it's just like a novel way of like taking your data, transforming it, and getting insights from it. And I think that is something that like this whole LLM like hype wave has inspired that wasn't really possible before this.

[00:16:09] Sree Kamireddy: Yeah, that, that makes sense. And hopefully, hopefully that, uh, that, uh, improves the reliability and the factual nature of the answers. One thing, Jerry, I wanted to, uh, uh, continue on was you mentioned, hey, a lot of teams are just, uh, just using SetTopK for retrieval and aspects like that. So I'm curious, um, if you, if you have a couple of top tips that you think are low hanging, low hanging fruits that the teams have not, uh, not used in, at least for this architecture.

[00:16:40] Jerry Liu: Yeah, um, I think maybe just two. We have, like, entire presentations on this, and so I'm happy to pass along those resources, but maybe just, like, two to start with. One is, um, I've seen other people also tackle this idea of, like, decoupling the chunk used for retrieval versus the chunk used for the LLM. And so instead of embedding, like, a giant text chunk, maybe embed, like, a smaller reference or a summary of the underlying data you're trying to reference.

[00:17:04] Jerry Liu: Um, And then, you know, uh, embed that, and then when you actually feed the context of a language model, you retrieve the larger text chunk or whatever is referenced. Um, that tends to be pretty powerful for enhancing both retrieval and synthesis. Um, the other idea is just adding metadata, adding structured information, um, to like vector databases like Chroma, support like metadata filtering, and I would encourece people to take advantage of that because like just storing just flat chunks of text, uh, is probably like the most naive thing you can do.

[00:17:33] Sree Kamireddy: Yeah, that makes a lot of sense. I think you're basically saying embed the meaning, not the entire document. And with that embedding, when you retrieve, you can retrieve the full, full document and use the metadata and use the metadata in a structured manner. So it's not always just unstructured, but both structured and unstructured and take the power of that.

[00:17:53] Sree Kamireddy: That's, that's pretty interesting. Sara, interested to hear your thoughts on the, on the original question, um, which is about, uh, all of the, uh, the, uh, the AI platforms. Some of the AI platforms in the recent months, having streamlined this RAG architecture, how is it going to, say, impact the space in general, but also the uptake of generative AI in the enterprises, and maybe a couple of top tips you have for the teams to implement GenAI easily.

[00:18:21] Sara Hooker: Yeah, well, I love already the discussions from Jeff and Jerry. We're all like, oh, it's a bit cheeky that this is being discussed now because we've all been working on it for a while. So, yeah, Cohere has invested a lot in this space. So, um, I think we released our RAG, uh, kind of productionized version earlier this year.

[00:18:40] Sara Hooker: I think it's worthwhile thinking, why is this meaningful? Like it's, so the key challenges with large language models are personalization, but also falling out of date. So these are large models. So it's very expensive to retrain, to fine tune. In fact, like a lot of the work is like, how do you find Chinese here?

[00:18:58] Sara Hooker: So, RAC is really this idea that you can couple a large language model with an external database, and you can kind of overcome these issues with your data being very different from the large language model, or your data changing over time, which is like most of the world. Like when you actually move outside this like weird static kind of research use case.

[00:19:18] Sara Hooker: Most of how people actually use models involves new data over time. So that's why I think we're all so excited about RAG. I think that what's interesting in terms of what enterprise people should think about is integration. So your data probably as a company is in many different settings and I think a large question would be, how do you make integration really easy?

[00:19:42] Sara Hooker: And how do you make it easy to work with existing databases? So this is like a really interesting, uh, both technical challenge, but also for enterprise, think about this as you think about how you want to, what provider you want to go with and you know, what technical support that you have. The second is, um, I completely agree with, um, and I think this is what Jeff was alluding to this idea of like, how do you work with all these different setups.

[00:20:04] Sara Hooker: Um, to Jerry's point, I also think it's really important to think about what you put in your, uh, database. So, we've done a lot of research. We've actually, uh, released papers on, like, RAG for toxicity mitigation. And in general, showing how you can mitigate a lot of this drift over time by using these retrieval combined with, um, I agree.

[00:20:23] Sara Hooker: The notion of how you retrieve and the lookup is also important, but perhaps most important is what you put in. And the way that you structure your data that you put in. And this is really interesting. It's worth thinking about and talking to someone who has experience with RAG because it makes a big difference in the overall performance.

[00:20:42] Sara Hooker: So these are just some things to think about, but I think the reason we're all so unanimous in the fact that, oh, it's nice that this is getting the attention that it deserves is because it is a really meaningful way to pair Large parametric models, which are very costly to retrain with something which is more flexible and can be personalized for your use case.

[00:21:01] Sree Kamireddy: Makes sense. Makes sense. Um, um, let me, let me jump to, um, a, uh, audience question that just came up. Um, as, uh, as one of the panelists, um, I think it's, uh, Jeff that mentioned reliability is a key thing. How, how to achieve it, how to create fall fallback plan. Can LLM just answer? Just answer that. Hey, I don't know the ans the answer or some confidence core and, uh, integration so that there is a fallback strategy.

[00:21:37] Sree Kamireddy: Pretty much the question, question is saying, Hey, if LLM, if you don't know, please say so right?

[00:21:44] Jeff Huber: Yeah, yeah, I think that, um, and I'd love to know Sara and Jerry's opinion here too, but, um, we have ideas here, um, that have not yet fully, uh, manifested yet. We've been busy with other things, but, uh, we think this is possible.

[00:21:56] Jeff Huber: And so, you know, intuitively, right, if you have a supporting sort of database of evidence, it's, you know, fundamentally pins on a map, right? That's the analogy for latent space. It's just pins on a map. It's a high dimensional map, but it's a map. And then you land a query pin, and there's no supporting evidence anywhere nearby on that map, you know, maybe you don't know about that, you know, maybe you shouldn't sort of move forward confidently, right?

[00:22:15] Jeff Huber: And so, there's something about that, like, density of embedding space, which we think may be one powerful tool here, um, related to that is an idea around, like, fencing, and again, these are ideas that I've not even seen, like, reach production yet, so we're kind of maybe reaching a few months forward here, but...

[00:22:28] Jeff Huber: Um, using embedding spaces for fencing as well, and so saying, like, okay, you know, these inputs, or this, this, this domain, this section of the map, both for inputs and outputs, I'm going to greenlight. Um, and then, you know, if the input or output falls out of distribution, falls out of that greenlit area, um, you know, again, it's going to raise an escalation somehow, or throw an error, or something else like that.

[00:22:48] Jeff Huber: And so, I think there's tools there that, um, like, again, to Sara's point, historically, a lot of this stuff has not been done in academia. Simply because the incentive structure wasn't set up for shifting datasets, right? You want static datasets, that's how benchmarks work. Um, but real world data is not like that, and so I think that sort of explains why there's like, a lack of these tools, um, today.

[00:23:08] Jeff Huber: That being said, obviously there's a lot of interest and effort in this direction. Um, and it's something that, you know, obviously at Chroma, we're interested in pursuing more on the research side. Um, there's plenty of, you know, there's open source, right? So like, you know, plenty of people all over the place that are doing interesting work here.

[00:23:21] Jeff Huber: Um, so that's like one thought and some ideas.

[00:23:24] Sree Kamireddy: Awesome, awesome. So essentially, um, if the, if the answer, um, is not in the space of what you've been trained on, hey, how can you be confident, right? Like that's, that's, that's one approach that you mentioned. Um, I'm, I'm curious, Sara, has, have you seen in, in your research anything around tagging the LLM response with confidence?

[00:23:47] Sara Hooker: Yeah, I think so. I guess it's a few different levels, right? There's one where typically productionized models have a layer which is fairly simplistic, which is rule based, where you don't want to do a response. But then there's also a conditioning of the model space where you kind of train to not respond to certain types of questions.

[00:24:06] Sara Hooker: Jeff, I, I think what you're talking about with fencing might actually be similar to what we saw with the stable diffusion safety filter, which was a little crude because it was open source. So it was easily, um, perhaps subverted, but the idea there was embedding similarity with some embeddings that were non disclosed that I, I believe, uh, was inferred was very related to violence or nudity.

[00:24:27] Sara Hooker: And so if the embeddings were too close, typically the stable diffusion would fail to produce. So these are existing ideas. I think we need to add, uh, you know, much more research momentum around them, but there are ideas which are currently productionized around this idea of where you abstain. What's the core problem here?

[00:24:45] Sara Hooker: The core problem here is that typically if you clamp down on low, certainty, uh, generations, you also get rid of what people really like about these models, which is their creativity. So we often talk about hallucinations in a bad framing when it's decoupled from factuality, but it's also what we tend to like about like, oh, the ability to interpolate, and that is most creative when it tends to be in a sparse, sparsely populated space.

[00:25:10] Sara Hooker: So there tends to be this tension. So what I would say There's a real interesting research question here is how do you distinguish between those sources of uncertainty? Um, and the real goal is you, you will have many sparsely populated spaces. Some are actually quite, um, uh, quite useful for how people think about these models as engaging or creative, um, or iterative.

[00:25:33] Sara Hooker: And so, um, we want to, I, just decouple those from these that are, uh, really, uh, what annoys us when they get a fact wrong, or when I type in, tell me about Sara Hooker, and it gives me this amazing, uh, kind of story about me winning a championship, which is not true. So I think that's the interesting research question that we actually have to tackle.

[00:25:55] Sree Kamireddy: Yeah. Maybe soon Sara, maybe soon you'll win a championship.

[00:25:58] Sara Hooker: Oh, perhaps.

[00:25:59] Sara Hooker: Well, according to Wiley's LLMs, it's already happened. I won the championship twice, so I'm now retired.

[00:26:09] Sree Kamireddy: Oh, breaking news. LLMs can predict the future. That's what's happening right here. So one thing, a couple of questions I wanted to ask based on What we have discussed so far.

[00:26:22] Sree Kamireddy: One is, let's say I have, I've created this golden data set for evaluation and benchmarking that we all talked about, right? What, what are those metrics that now I as an enterprise should track so that I can deploy LLMs with confidence and responsibly? Jerry, you have any thoughts on that?

[00:26:45] Jerry Liu: Yeah, sure. I mean, I don't think, to be honest, I think everyone's kind of still figuring this out in terms of like standard, a standard set of metrics that works for like every use case.

[00:26:53] Jerry Liu: Um, and so the cop out answer is basically like some of this is going to be kind of use case specific. And so actually every enterprise team building, I should probably figure out like the right business metrics that they want to optimize for. For instance, like if you're a consumer app, like a lot of people, you know, it might be okay just to test in production.

[00:27:08] Jerry Liu: Just run A B testing on whether consumers actually like, like the responses, right. And collect user feedback. For stuff that you can't really test in production or get consumer feedback, you might want to have good offline metrics, but some of it might be very use case specific. I can just talk about like some basic, um...

[00:27:23] Jerry Liu: Uh, metrics that I think, you know, we have within the core framework, uh, that I think, uh, like, there's evaluation libraries that people are using as well. Um, some basic ones is basically using LLMs to evaluate other LLMs and figuring out creative ways to do that. And so, um, If you have a ground truth golden dataset of like question and answer pairs for RAG, for instance, you can basically just use an LLM to compare the predicted response against the generator response.

[00:27:49] Jerry Liu: You could also compare that via like the right embedding representations, but that's kind of like an open problem. We haven't been able to track like how good like semantic similarity matches up to like the kind of relevance of the answer. Um, Another one is, uh, if you're building, like brec, for instance, uh, you should probably just do a benchmark retrieval on its own, which is not an lmm problem, it's just a retrieval problem.

[00:28:11] Jerry Liu: If you don't get back the right answers, you're not gonna get back the, the, the, the contacts of that m could use to actually synthesize something. And so that's just a ranking problem. Like you use like, uh, NDCG, uh, MRR, you can basically create some sort of, uh, ground truth golden dataset of retrieval and then see if that works.

[00:28:28] Jerry Liu: Um, and then maybe the last point I'll make is, I think that is one part of it, and then the flip side is the fact that there's some interesting things with, like, um, interactions between retrieval and synthesis that happens in REG that is an argument for more end to end devials, because a lot of times, like If you do, like, re ranking, or if you do, like, chunk, uh, different, like, chunk sizes or something, like, there's things that might improve your retrieval metrics that don't necessarily improve your generation metrics.

[00:28:55] Jerry Liu: And so you probably want to have both.

[00:28:58] Sree Kamireddy: Yeah, that makes sense. That makes sense. Awesome. Um, so, Jeff, um, um, any thoughts on this? Anything to add? Uh, any, any areas we should deep dive on for this?

[00:29:11] Jeff Huber: I would, yeah, I would second everything Jerry said. I think it's right on point. Um, you know, I would, I would actually double down and say you should go even deeper on like benchmarking your, um, your retrieval system.

[00:29:22] Jeff Huber: So one thing that people don't necessarily realize about how most retrieval systems work is The moment you flip from known nearest neighbor, which is a brute force approach, to approximate nearest neighbor, which is fast, it is approximate, and how good your approximation is, is strongly dependent on the type of embeddings you have, how often you're updating those embeddings, and then the underlying algorithm, um, that powers that, and, uh, we've seen in some cases, um, like not to get into the name of the game, but like, you know, there's certain scenarios where people's recall, which is, You know, if I brute force it, I would get back these 10, right?

[00:29:58] Jeff Huber: Okay, if I do an approximate nearest neighbor, I expect to get back eight or nine of those, at least, right? Um, and we've seen cases where people only get back, like, five, um, from those ten. And so, obviously, as Jerry was saying, if you don't get back the right retrieval, like, good luck with the generation.

[00:30:14] Jeff Huber: And so, uh, this is one thing that I think people are not sensitive enough to yet. is understanding that recall for approximate nearest neighbor vector search really is an important thing, and people are not paying attention to it, people are not monitoring it either, um, they don't know, right? Because this can also change over time as you update the graph.

[00:30:33] Jeff Huber: And so, uh, I plug that, at least, as something that's also important.

[00:30:38] Sree Kamireddy: Oh, yeah, that's definitely quite important and I hope teams start checking on the recall. So one thing, Sara, I wanted to go back to something you had mentioned a few times, which is about personalizing the models, adapting the models for your use cases and enterprise.

[00:31:01] Sree Kamireddy: I want to, I want to go, uh, quote, uh, Jerry, um, from, uh, from his past podcast where Jerry mentioned, Hey, RAG is an efficient hack, right? Um, so I want to, I'm curious under what circumstances you feel like RAG is sufficient and when the team should put in the effort to fine tune. So essentially, what is your recommendation?

[00:31:29] Sree Kamireddy: When should the teams fine tune? When should maybe even build from scratch or maybe RAG is sufficient for their scenarios?

[00:31:37] Sara Hooker: Yeah, I think that's worthwhile thinking about, um, yeah, this is a fun question. Wow, Jerry, I'm excited for you to come on after this. This sounds like a good from statement. Yeah, I mean...

[00:31:47] Jerry Liu: Wait, wait, wait, hold on. Just to clarify to the audience, like, I, I, I did a podcast with swyx Latent Space, okay? And then I said the words, RAG is a hack, and then I said the words, it's a very good hack. And he cropped out the second part, and only put the first part, and tagged it on a Twitter post. So just for some context.

[00:32:07] Jerry Liu: Anyway, back to you, Sara.

[00:32:08] Sara Hooker: Well, I think it's a fun starting point. So, I mean, the question annotation really gets this idea of why is fine tuning challenging. So, typically, the larger the model, um, firstly, the more cumbersome to fine tune because you have to update more weights, but also, typically, the more data you need to meaningfully update.

[00:32:29] Sara Hooker: Our hack to this with fine tuning is adaptive style systems. So, you know, often we have small little weight pockets that can be kind of switched in and out depending on the use case. Why it's beneficial is it means that although you're adding a few weights, it means that from a device perspective you can fit into less devices because you don't have to host as many copies of the weights for different use cases.

[00:32:51] Sara Hooker: So, um, Is RAG an alternative to fine tuning? It can be, but it can also be complementary. I think people don't realize this, is that you might actually have various different, um, ways that you're adapting at the same time. It's actually not a clear, like, mutually exclusive option. What RAG allows is probably, um, in some ways, uh, a more traditional way of...

[00:33:14] Sara Hooker: updating a model that might be familiar to enterprises because your notion of how you're interacting might just be in terms of a database. Um, and, uh, that can be quite beneficial if it doesn't feel as foreign as like maybe fine tuning or adapting weights. But I actually think the most interesting combinations will come from both.

[00:33:32] Sara Hooker: So maybe you'll do, um, some ongoing fine tuning, uh, and maybe have adapters which are specialized to different parts of your business. Because you might be using the same model in various different areas, and then you might combine that with REG. So, um, I do want to kind of zoom out and kind of maybe, uh, speak to why Jerry might say it's a, uh, kind of, uh, an unsexy hack, if I could reframe it.

[00:33:58] Sara Hooker: Um, Ideally, we have in the future, we have systems which are more flexible, like these models in some ways, it's pretty basic that we have to update all weights at once, that we don't have specialization within these massive models. And this is a key limitation and there's very interesting research directions around mixture of experts where you have more specialization.

[00:34:20] Sara Hooker: But frankly, a mixture of these experts right now delivers on efficiency, but it doesn't really create true specialization. You kind of have to squint to see it. Um, so I think there's just really interesting directions around that where we avoid things like this, like imposing another database or a site set of weights after training, because in some ways, what it means is we're doing acrobatics around the fact that we haven't designed.

[00:34:42] Sara Hooker: A modular, flexible architecture from the get go. Um, so that's maybe my interpretation of why I kind of, I could see why it might be cool to hack. But Jerry, I'm curious to hear what you think.

[00:34:54] Jerry Liu: No, I mean, I think, I think all your points I totally agree with. I'm trying to see if there's anything intelligent I can add.

[00:35:00] Jerry Liu: I think maybe, maybe the general abstraction I'm thinking is like, um, the way I think about it is If you think about LLMs as reasoning engines, um, every, like, vector database, like, database API system, uh, can be kind of thought of as just, like, a tool. Right, and if you think about a human, a human uses tools, um, many of these tools are imperfect, like if you search something on Google, you're not going to get back necessarily the, always the right results, but that means that you as a human will go back and, and enter like another search query or use Bing or something to actually find the results, or, or ChatGPT, right, to, to try and find the results that you're looking for.

[00:35:35] Jerry Liu: And so I think, like, what we're doing now with RAG is that we're kind of like bundling everything into one architecture, but really what RAG is, is it's an LLM of reasons over using a vector database as a tool. Um, and so I think, like, what's interesting to me is that, you know, there will, if you think about an intelligent system, there will always be tools that aren't part of your own, like, system that you'll have to use and you'll get back information, um, but there will always also be, uh, potential, like, you know, art, like, uh, uh, potentially like a stuff that is like within kind of you, right?

[00:36:09] Jerry Liu: Like, like, for instance, your own conversational memory or stuff. And so I just don't think we've figured out like the right ways to tackle some of these things like personalization and conversational memory and how to actually integrate that maybe as part of the architecture itself. Um, and, and I think of, like, kind of interacting with, like, a vector database, for instance, as just one example of tool use.

[00:36:28] Jerry Liu: But I think in the future, this will probably just evolve into some generic, like, a gigantic interaction with an external system, as well as being able to internalize concepts. Um, and, and so I think that that's, like, the part that I'm not completely sure where the, like, final architecture for that will be.

[00:36:44] Sree Kamireddy: It makes sense.

[00:36:46] Jeff Huber: I think in today's, like, current regime, you know, there's no, we don't yet know, and maybe we'll never know, maybe it's not possible to, like, deterministically update the weights of a neural network to get, you know, precise behavior to change, or precise facts to be changed, or whatever. Um, and this is why Jerry's alluding to this idea of, like, separating out the reasoning capabilities of the language model from the memory capabilities of the custom systems.

[00:37:06] Jeff Huber: Um. I think that's like a pretty compelling idea. I think like in today's reality, one mental model that I think is useful, maybe it's not entirely correct, Sara can debunk this if it's wrong, is that uh, you know, fine tuning is kind of teaching a model like how to think, and then the retrieval or other data sources is teaching the model like what to think.

[00:37:25] Jeff Huber: Um, and obviously those two things combined can be pretty powerful. I think the reason that most people today are not doing fine tuning, and mostly just retrieval, is because, turns out, like, general English language how to think knowledge, which is what has been put into the model based on the entire internet, works pretty well for most domains.

[00:37:41] Jeff Huber: Um, that being said, if you want to do like legal stuff or doctor stuff, that's a different way of thinking. It's a different ease. And so like, that's probably a good case for fine tuning. Um, I think in the future we'll see, so today is relatively little fine tuning. I think we'll see a lot more fine tuning in the future, particularly as models get much smaller and faster, um, and people want to run them like an open source and otherwise, um, like you'll see.

[00:38:04] Jeff Huber: I would, I would, I would wager to guess at least 50 percent or more of inference would be done by, uh, fine tuning models, um, versus, like, sort of generic models. Um, now, when does that happen? Maybe, I don't know, two or three years away, um, so it's certainly not tomorrow, but, um, I expect to see fine tuning become much more important in the future than it is today.

[00:38:23] Sree Kamireddy: Yeah, that makes sense. Fine tuning smaller models and making it efficient, and I think one of the other things is fine tuning and RAG, they can all work together and... Uh, really deliver the business impact. We, we have a few minutes left, but, uh, the results for one of the polls came back where we asked, Hey, what metrics do you care about?

[00:38:42] Sree Kamireddy: And, uh, uh, cost actually is one of the top, top metrics they care about. Um, and then hallucinations, right? Like almost 70 percent of the audience cares about hallucinations. Uh, 60 percent care about cost. And there's one, one, uh, audience question. that I want to ask in this regard. Um, and Sara actually mentioned this, uh, which was, um, about efficiency, but, uh, primarily around quantization.

[00:39:08] Sree Kamireddy: And the question that the audience member is asking, how does quantization impact the quality of results? And how relevant are the techniques related to tuning deep, deep learning related to tuning LLMs?

[00:39:21] Sara Hooker: I can speak to that. We've done a lot of research on quantization at scale. So typically up until very recently there would be this really interesting phenomena where if you go above 6 billion parameter models, if you try and quantize, you see market cliffs and performance.

[00:39:36] Sara Hooker: Um, it, we actually released work this year that shows that a lot of this is because of the pre-training conditioning. So particularly if you use more robust representation during pre-training, like be float instead of float. Even if you're using the exact same representation, so if you're using bfloat16 versus float16, you actually see that, um, you have much better robustness when you quantize.

[00:39:59] Sara Hooker: So a lot of the quantization research now is like, how far can you push it? Can you go down to one bit? Can you go down to four bit without reducing precision? Why does this matter? Because quantization is one of those tools where... You really are having your memory each time. So if you go from 8 bits to 4 bits, you're having it.

[00:40:19] Sara Hooker: 16 to 8, you're having it. So it's a very powerful tool to quickly mean reduce the number of devices you have to host on. And typically when you're in a multi device regime, this is the biggest factor which is impacting your cost. And your latency, because what it means when you're in a multi device regime is that you essentially have to route, um, and you have communication overhead, um, but it also means you're paying for multiple devices.

[00:40:43] Sara Hooker: And so this is really the crux of, like, the efficiency problem at scale is how do we reduce it to fewer devices? Um, And my dream, which Jeff is alluding to, is to get it down to the size where you have much more flexibility to fine tune. I actually think fine tuning is really critical just for adapting to data change over time.

[00:41:05] Sara Hooker: Even forget the personalization use case, but even core developers of the large language models, what we want to do is always be up to date with the data that changes. Language evolves over time. That's what makes it so powerful for how we communicate and how we connect to each other. So fine tuning is really critical right now.

[00:41:23] Sara Hooker: It's pretty cumbersome. So like, how do we make that easier? One way is techniques like quantization that just make these models more nimble.

[00:41:30] Sree Kamireddy: Yeah, that makes a lot of sense. Jerry or Jeff, anything, any thoughts based on your work?

[00:41:40] Jerry Liu: No, I don't. Sara knows way more about this than I do.

[00:41:44] Sree Kamireddy: So one thing before we run out of time, Jeff, I wanted to, uh, I want to ask you, given that vector DBs have become such a crucial part in this generative, uh, generative AI journey for all the enterprises, what are the top three things that customers request that are currently not in the market?

[00:42:03] Jeff Huber: That's a good question. Um, I think that we're still, you know, a lot of my comments about, uh, being, being fairly early in the adoption cycle here is because when we talk to enterprises, sort of ask them about, Okay, like, what are your requirements, you know, QPS this, latency that, and we ask them about, like, what are the features do they want, you know, we're expecting to hear back, like, oh, we really want to have, like, density and clustering, we want to have these visualization tools, and we want this, and basically everybody is silent about all requirements, um, so, like, people are, like, people don't know, um, they don't have, they don't know what their QPS needs to be, and, uh, they also don't know, like, what other features would be beneficial.

[00:42:39] Jeff Huber: So, I think they're basically at the typical place that ML has been, deep learning has been for a long time, which is like, we made a thing with some demo data, or some data, we got into this demo state, and uh, now we want to put it into production, like, what are the tools that are going to get us to cross the chasm?

[00:42:55] Jeff Huber: Um, and of course this has been a thing for deep learning for a long time, is that gap between demo and production. And yeah, obviously it still exists today, using language models, using embeddings. Um, and so, I mean, that's like the biggest one if it's, you know, singular point of like frustration is like, Hey, it's soft, because in the software you can make things better, right?

[00:43:13] Jeff Huber: You kind of generally know, okay, it's not doing what I want, I should do this, this, or this, and that would be better. It's gonna require time and work, but it'll work. Um, I think with AI, it's a little bit different where it's like, okay, well, it's kind of working, how do I make it better? And you're, you're stuck, you're at a dead end.

[00:43:27] Jeff Huber: And so, I think, you know, if I had to choose like one thing, that's like the biggest sort of like, existential, like, fear or frustration with folks. That is what it is. Um, you know, beyond that, I think that, yeah, again, we're not, we're not seeing people, uh, you know, people are still... At best, people are asking for a faster horse, right?

[00:43:45] Jeff Huber: They're not asking for the car. Um, and I think in most cases, people aren't even asking for a faster horse. They're just saying like, why horse not good? Uh, so we're early. We're early. Yeah.

[00:43:55] Sree Kamireddy: That's, that's quite, quite an interesting analogy. Uh, thank you folks. This is, uh, we are out of time. Thanks Sara, Jeff, and Jerry.

[00:44:02] Sree Kamireddy: Wow, so much, so much knowledge, so much information. Thank you so much. Really appreciate it.

[00:44:08] Kirti Dewan: Thanks, Sree, Sara, Jerry, and Jeff. That was a great discussion. Awesome conversation on RAG and fine tuning and embeddings and all the goodness that is going to be coming out in the in the hours, days, and weeks to come.