AI Forward - Fireside Chat: CEOs in Conversation

In this fireside chat, Lukas and Krishna discussed the rising potential of search, summarization, and chatbot applications in leveraging LLMs, emphasizing their transformative impact on business operations. Key challenges highlighted include the complexities of LLM deployment at scale and the importance of strategic evaluation and testing in maximizing these technologies' effectiveness.

Key takeaways

  • Advancement and Integration of AI in Business: The role of AI-driven search, summarization tools, and chat interfaces makes a significant impact on improving and innovating business operations, particularly through the application of LLMs.
  • Importance of Strategic Evaluation and Testing: Rigorous evaluation and testing methodologies are critical to improve LLM applications, with an emphasis on the balance between automated and human-led assessments.
  • Evolving Skills to Deploy and Manage LLMs: The evolving landscape for deploying and managing LLMs in business underscores a significant shift in skills, with traditional software engineers transitioning towards AI application development, aided by increasingly accessible APIs and a simplified complexity in machine learning.
  • Practical Implementation of LLMs: Enterprises looking to implement LLMs, focusing on starting with simple applications and iteratively incorporating more complex features and fine-tuning.


  • Lukas Biewald - CEO and Co-founder, Weights & Biases
  • Krishna Gade - CEO, Fiddler AI
Video transcript


[00:00:04] Kirti Dewan: Welcome everyone. Um, we are now in the fireside chat between Krishna Gade, CEO of, uh, CEO and co founder of Fiddler, and Weights Biases CEO, Lukas Biewald. They are going to discuss industry trends, customer buying patterns, and how the AI world is shaping up around ML it definitely is shaping up, uh, with something new every day.

[00:00:27] Kirti Dewan: Um, so this should be pretty fun. Krishna, over to you.

[00:00:34] Krishna Gade: Thanks, Kirti. Um, hey, Lukas, uh, thanks for, thanks for coming to this fireside chat. Uh, you know, I think, um, I'd love to give, uh, an intro to Lukas. Lukas, uh, has been pioneering this, you know, state of ML. Uh, we both, uh, were working on search, uh, many years ago. And I, you know, funny story is I had the great opportunity to join his team when he was building a natural language search engine back in almost many years ago, 15, 16 years ago.

[00:01:06] Krishna Gade: And. And, um, it, you know, it, it didn't happen for whatever reason, but, you know, essentially I followed his career. He sort of went on to do great things, you know, CloudFlower and now it's in Biosys. Lukas, you know, very nice to have you on our, on our, on our summit, you know, I'm excited for our conversation.

[00:01:24] Krishna Gade: Thanks, Krishna. Awesome. So, uh, so I guess, uh, Lukas, you know, this, this talk of the town is generative AI. We had several things happen this week, you know, AI safety, executive order, the OpenAI demo day. Let's start from the beginning. Right? So, you know, what do you think are your, you're seeing as some of the promising areas where, you know, your customers are in general, you're talking to are applying generative AI?

[00:01:49] Krishna Gade: Yeah. Absolutely. And right now, and also in the future.

[00:01:53] Lukas Biewald: Sure. I mean, it's, it's, um, you know, the future is so hard to predict. Um, and even, you know, kind of getting a handle on what's happening right now is often, you know, a challenge, uh, you know, for us here at Weights and Biases.

[00:02:05] Lukas Biewald: But I think that, you know, the use cases that we see working today that we expect to see, you know, a lot more of in the near future. Um, I think it comes down to like search and summarization. I mean, that, those are the things that seem to work like just phenomenally better, um, than in the past, right. So we see a lot of our customers, you know, kind of building, um, uh, you know, essentially like. Chat interfaces to their documentation.

[00:02:30] Lukas Biewald: In fact, we've done that internally at Weights Biases, and it works really great. So, you can go to our Discord channel and actually just, you know, chat with a bot that's pulling up, um, docs. And, and also, you know, internally, we use it to summarize feedback from our, our customers and, and, um, Um, and, and GitHub issues that come in.

[00:02:49] Lukas Biewald: And so I think there's, there's a whole slew of, um, summarization use cases in, in a business context that, that works super well. I mean, it's also pretty noticeable how well, um, generative AI works for, for writing code. I mean, you know, I use, um, Copilot all the time. I, I put tons of questions into, to GPT, you know, as I, as I run into issues.

[00:03:13] Lukas Biewald: So, um, you know, clearly that's a case that, that. That works, um, that works super well, and I think a lot of people have switched to it. So, um, you know, those are the big ones, but I mean, there's so many, um, creative uses out there that kind of unclear what's going to really take off.

[00:03:29] Krishna Gade: Yeah. So like search, you know, chatbots, are seem to seem to be top of mind. Yeah. Great. Great.

[00:03:35] Lukas Biewald: Is that the, that's the same stuff you see, or do you think I missed something?

[00:03:37] Krishna Gade: Yeah, I think it is, it is similar, you know, people want to do internal, you know, documentation searches, you know, customer support is something that we see quite often. So I guess like, you know, there are also some shortcomings with LLMs, right?

[00:03:51] Krishna Gade: And it's all not hunky dory. I mean, obviously it's amazing technology. Uh, uh, what do you think, uh, you seeing, you know, when you talk to your enterprise clients, you know, when, when, you know, what are some of the things that you have experienced yourselves in terms of shortcomings of LLMs today? And, you know, how do you think we need to address that?

[00:04:10] Lukas Biewald: Sure, well I will say, you know, the people that are least excited about LLMs that I talk to are execs at bigger tech forward companies, and I think the biggest issue that they're worried about is how hard it is to deploy these LLMs at massive scale. So, you know, I think in terms of like. Yeah, deploying it at small scale or deploying it in offline mode, you know, I think that works.

[00:04:33] Lukas Biewald: Not a problem at all. You know, I mean, I, I think like, um, you know, you can run GPT, you can run, I mean, these days you can run LLAMA2 or Mist or whatever on your, on your laptop. I do. Um, but I think that, um, actually making this work at sort of the, like, internet scale, I think is pretty, pretty challenging. I mean, I feel like OpenAI has done it, but hard to really argue.

[00:04:54] Lukas Biewald: I mean, maybe BARD, you know, maybe, You know, the Azure, like there's a little bit of this, but it's really, really difficult. And so I think like, um, there, I think you see like a lot of kind of fear where I think a lot of execs of these companies are saying like, you know, my CEO is like, okay, this changes everything, but I'm sort of responsible for making it, making it happen.

[00:05:11] Lukas Biewald: Um, and I think that there's not a, like a clear path to that. And. This stuff demos phenomenally well, but every time you try to make this actually work and actually handle, you know, all the cases that are meaningful, it's way harder and it's kind of unclear what to do, especially if you're using like an off the shelf.

[00:05:31] Lukas Biewald: Um, model like, you know, like now you can, you know, you can fine tune, but that's a little challenging. And I'm clear how often that, you know, really works. And so it's easy to kind of get stuck and not know how to actually, you know, get these things deployed. And I think it's also, we're still learning what use cases really work and don't work.

[00:05:48] Lukas Biewald: So, um, you know, I get, I think like people get frustrated because these generative models can't solve literally every business problem that they have. And that bar is just ludicrously high.

[00:05:59] Krishna Gade: Absolutely. I think, you know, there is, there's like an interesting play between the FOMO from the CEOs, uh, to deploy generative AI and the FOMU, what we call the fear of messing up, you know, execs, as you just described, that something bad will happen.

[00:06:12] Krishna Gade: Right. I think let's dive into this a little deeper, right? You know, let's see, like, let's say if, you know, an enterprise is thinking about, you know, seriously developing and deploying LLM applications, what are some of the common pitfalls that they can avoid? You know, how do they go about it?

[00:06:27] Lukas Biewald: Well, you know, I mean, I think one of the biggest things is, you know, how do you handle the cases where the LLM doesn't work perfectly, right?

[00:06:35] Lukas Biewald: And that's very, very product dependent, right? But I think if you look at, you know, the places where this stuff, um, you know, kind of works well, like search is a great example where, you know, if you don't get the answer as the top result, you know what I mean? Like there's a second result, you can look at a third result and we're, we're okay with search only working, you know, 80, 90 percent of the time, right?

[00:06:54] Lukas Biewald: And so those applications. Can be important, but if you have like a way, um, to handle the cases where, where it doesn't work perfectly, um, it's going to be so much easier, so much more feasible to deploy it right now, because it's, it's very unclear what the true accuracy of these models are in lots of different applications.

[00:07:12] Lukas Biewald: In fact, it's even unclear. how you should be measuring, um, you know, accuracy. I'll give you an example. Like we have an internal chatbot, um, and you know, we, we think about like, okay, like, is it, you know, returning like a useful document to you, right? You know, you want that to be higher, obviously, but then, you know, we also occasionally, you know, hallucinates information, right?

[00:07:31] Lukas Biewald: And that's like a lot worse. Like, I mean, there's nothing worse than like wrong documentation, you know, that, that you get. And so, you know, you have to kind of figure out like, okay, how much to punish that. And as you try to pull up recall, You can also pull a hallucination at the same time. So even like, what's the right metric here that, you know, you want to be optimizing for a good customer experience is really important.

[00:07:51] Lukas Biewald: And, you know, what I always say is get something lightweight working end to end first and then iterate on it versus I feel like a lot of teams have this instinct to sort of like nail each step of the process and then you get to the end and you know, you have something that's unusable.

[00:08:05] Krishna Gade: Yep. And there's a matrix of choices here, right?

[00:08:08] Krishna Gade: There is prompt engineering, there is RAG, there is fine tuning, there's building your own LLM, and then there is like, use cases we talked about, internal, external, search, chatbot, you know, is, are you seeing like a marriage between these things, you know, and you talked about starting small and getting, you know, getting to it, you know, what, what is like a good baby step to get started?

[00:08:28] Krishna Gade: You know, like, uh, you mentioned you created your own internal chatbot, you know, like, how did you go about that?

[00:08:34] Lukas Biewald: Yeah, I mean, a great place to start is taking, you know, GPT, even GPT 3. 5 and just using it. And I think if you can do that, that's a very, very lightweight, you know, place to start. And you can literally just try it in the, in the chat interface and get a flavor for, you know, how well things are working.

[00:08:49] Lukas Biewald: Now there's cases where you can't do that. Yeah. And in, in those cases, you know, you're going to want to use, um, you know, maybe LLAMA or Mistral. And I feel like every day, That gets easier. So I think it's like, you know, it's actually not that hard, um, to stand this stuff up, but it does take an engineer and it does take, you know, like some, some effort.

[00:09:08] Lukas Biewald: And now like, I think RAG is kind of orthogonal to that, right? So if you have a knowledge base of any significant size that you want to be like searching over, I think you have to use RAG, right? I think like fine tuning is not going to get you, you know, the accuracy you need if you need to Feed in specific information that's not going to be available to like, uh, you know, GPT models or anything that's like in the uncrawlable web or that didn't crawl or, you know, it's like created in the last 12 months, you know, RAG is, you have to do.

[00:09:35] Lukas Biewald: Also, I don't think it's that hard. You know, I think it's like a new skill, but a lot of people are learning it, you know, a lot of options. I mean, now even, um, OpenAI will, will do it for you. So, um, you know, I think that's a, that's another good thing to do. And also. We're seeing more and more, um, fine tuning and I expect to see even more of that because I think it used to be hard to fine tune and, you know, QLORA came along, which made it a lot more feasible.

[00:09:57] Lukas Biewald: Like now you can fine tune on a single GPU in most cases. And now, you know, OpenAI even makes it like trivial. Um, AnyScale has an offering to, to make it trivial. And it really is just like adding examples into these models through an API. And, um, our experience is that basically always, um, helps you. So for any application where you care about.

[00:10:19] Lukas Biewald: You know, accuracy, why not fine tune it on, on examples that you have? I think like, you know, sort of like prompt engineering versus fine tuning versus RAG. It's like, well, you know, you should do all the things that you need to, to get, um, the quality you want. But of course the simplest thing to do is just try, you know, GPT-4 and, or 3.5. Why not, you know, start there and then kind of iteratively add, um, these pieces as you, as you want to improve the accuracy.

[00:10:45] Krishna Gade: Yeah. One of the things that our teams are also fighting with along with where to start is like what skills do they need to have, right? You know, and now we are talking to companies that, you know, don't have a lot of machine learning folks.

[00:10:58] Krishna Gade: And even if they have, maybe they have data scientists probably trained on the old school ML, you know, what have, what are you seeing from a perspective? How, you know, as they get started, do you think generative AI has You know, made the landscape, like, you know, sort of, uh, easy for software engineers to make the transition and become like AI application developers.

[00:11:19] Krishna Gade: You know, how are you seeing this in your side?

[00:11:21] Lukas Biewald: Yeah, for sure. I mean, don't you think so? I mean, you know, I think obviously, you know, Weights and Biases, you know, started off as a product for, um, you know, machine learning experts. Exactly. So we love machine learning experts. And if you want to train your own LLM, we love that.

[00:11:36] Lukas Biewald: Like we totally support that. Should you train your own LLM? Probably not, right? I mean, there's, there's like a lot of amazing models out there. I don't think you need a special, you know, machine learning, you know, education or experience to do fine tuning. I mean, fine tuning is pretty simple. You add, um, you know, examples to your model.

[00:11:54] Lukas Biewald: I mean, like you're kind of getting it to run. It used to be hard, but now with any scale and OpenAI, it's not hard. Um, and You know, I think at that point, really, it's just sort of like, you need to have the machine learning mindset, right, which is different than software engineering in the sense that you have to really look at lots of examples.

[00:12:11] Lukas Biewald: You have to be rigorous about evaluating the quality of your stuff. I think that is actually pretty different, um, than coding, but I think a data scientist, a software engineer, if you can hook up the API, you can measure and iterate. You know what, you're, you're, um, that's, that's really all you need. I think to, to make these models, um, work.

[00:12:29] Lukas Biewald: I don't know if you, if you agree with that Krishna.

[00:12:32] Krishna Gade: Absolutely. I think, you know, I think the landscape has become very flat, right? I think the knowledge gap is reduced and, you know, in so many ways, the APIs. Uh, that the models have, you know, you can get started with, you know, traditional like software engineers can make the transition to be AI engineers much more quickly.

[00:12:48] Krishna Gade: Uh, so I guess like we touched on evaluation, right? That actually is one of the big problems, you know, we both worked on search before and in many ways, I kind of liken the LLMs to be like a search engine type of problem where you have an open ended question coming at you and you provide an answer. You don't know if your search engine provided the right answer or not.

[00:13:08] Krishna Gade: And you rely on different ways to assess its performance. And, but now it's much more complicated because everyone is building a search engine in some ways, you know, now thousands of organizations. So it's, how do you handle this? So like, how do organizations approach evaluation and measurement, you know, uh, to make sure that they're LLM experiments.

[00:13:26] Lukas Biewald: There's help organizations. Do you do it and, and how they should do it, in my opinion, and, you know, I guess I'll tell you both here, right? So, you know, I've talked to a lot of people about this over the last couple of months. Um, and I've talked to, you know, some CEOs of large companies that you probably have used their products who tell me that the way that they test their generative AI is literally by vibes, testing by vibes.

[00:13:51] Lukas Biewald: I've heard that more than once actually is like that line, like testing by vibes. If you wouldn't do that with your software, you probably shouldn't do it. Thank you. Um, with, with, um, LLMs, especially when you put it like that, like kind of obviously, right. But, you know, it kind of points out that testing is hard, but when you set up a good testing infrastructure, you know, now you can, you know, you can systematically improve what you're doing.

[00:14:16] Lukas Biewald: So, you know, I was talking to, um, uh, I hope he doesn't mind, CEO of, um, Case Text, which sold to, you know, Thomson Reuters, um, a few months ago. And, and actually this is like a really great example of like a legal. use case. And he had a very, very sophisticated, um, you know, setup around testing at lots of different scales.

[00:14:36] Lukas Biewald: So like, you know, you know, kind of quick tests that people could use at the prompt engineering and then sort of like longer and slower, more expensive tests that they would like gradually do before they put, um, you know, something into production. I think that's actually a lot of the value that they, um, that they built there and a lot of why they, you know, sold for, I think, 650 million, million dollars, right?

[00:14:54] Lukas Biewald: It was like, You know, they, they set up a really good testing harness and what that let them do was gradually improve the, um, the, the, um, the prompt engineering. And I think, I think all of their stuff, all of their IP was prompt engineering on top of, um, GPT, but they got it really good because they had really good evaluations.

[00:15:12] Lukas Biewald: That tells you right there, you know, you really should be thoughtful about the evaluation. And I think... You know, I think you want, like, kind of different scales of evaluation, and it really depends on the application, but, um, a very common pattern is to do the first pass of evaluation with an LLM itself.

[00:15:30] Lukas Biewald: And that might sound, you know, counterintuitive, that would work, right? Like, you know, asking GPT to evaluate GPT. But you know, in a lot of cases it does work. We've seen cases where that evaluation is really reliable and we've seen cases where it's really unreliable, at least compared to human, you know, evaluation.

[00:15:47] Lukas Biewald: If you can still call it the gold standard, maybe it won't anymore. Um, but, um, but yeah, I think it really, that depends on. on the use case. But I think that the thing that's common across like all use cases is that you really should be taking evaluation seriously so that you know where, where things are improving.

[00:16:06] Lukas Biewald: And, and I found it was prompt engineering. I don't know about you, Krishna, but like in my own, you know, limited experience with prompt engineering, I've been surprised at how Prompt improvements that really seem like they should improve things make it worse. You know, and, and, um, and you wouldn't, you know, know that if, if you didn't have like a reasonable set of data to test against.

[00:16:26] Krishna Gade: Absolutely. I think the model graded evaluation is, uh, definitely something that we are seeing, uh, emerge because I think it's literally not scalable, you know, to have like a, Like a human rating system available that can judge every response, every nuanced question. And, and so this is, uh, this is not a scalable thing.

[00:16:45] Krishna Gade: And, and I think that's, that's, that's the way forward it seems like. And, and, and then definitely prompt engineering is, uh, you know, engineering the prompts and supplying the context. You can navigate the LLM in base and you can mitigate some of the performance issues that you might see. That's something that we're also seeing.

[00:17:00] Krishna Gade: Great. I mean, I think, you know, Who are listening? Lukas spent years building, you know, human, human evals with CrowdFlower and all, so if he's telling evaluation is important, you all should take, take it very seriously in your organization. So this is a, it's a very important topic. So maybe like, let's, let's take a few questions from audience as well, Lukas.

[00:17:20] Krishna Gade: So I guess, uh, you know, like one of the things, you know, this is a, this is basically like top of the mind for everyone related to evaluation. So many of new LLMs are dropping every month. How should people pick an LLM to commit to and deploy in their production pipelines? Very realistic question. And I have this question myself, you know, what would you, what would you answer?

[00:17:40] Lukas Biewald: Well, I'll tell you what, you know, I go on Twitter and I see like, you know, what gets the most like retweets and I go with that one. The vibe. The vibes, yeah, the vibes. I mean, I guess, you know, I, I think that there's like a couple, um, You know, kind of ones that like are sort of thought to be, you know, the best for a lot of, you know, applications.

[00:18:00] Lukas Biewald: Like LLAMA2 was that, um, you know, LLAMA2 Chat, LLAMA2 Instruct, and then, you know, Mistral, um, you know, came along and I think there's evidence that, that that's better. Um, you know, I, I think that this may change like day to day, which again is why you want like a good, you know, testing harness set up, but it's not that hard to swap these things in and out, right?

[00:18:21] Lukas Biewald: They have like basically the same. API, all of them. I mean, you might want to change the prompt to, to match, um, the way they did the, um, RLHF, but, um, but I think that picking a, I mean, I think there's like a size of prompt that you, you care about, right? And that, like, you know, for some applications, the latency may matter there.

[00:18:43] Lukas Biewald: You know, the smaller models, um, will be easier to run, you know, in an embedded context, it'll be easier to run on your laptop, it'll be cheaper to run. You can run, you know, more of them if, if you care about, you know, throughput. Um, you know, the way to look at it is like, what's the biggest. Um, model that you can get running on one GPU, or what's the biggest model that you actually are able to physically run on, on your computer.

[00:19:07] Lukas Biewald: Um, I think like once you've picked the size, I think then, you know, this is going to get out of date if I was like watching a recording of this. I think that the one that worked best for us, uh, for our applications recently was Mistro, which actually was an improvement over the Lamba 2. You know, all that we've had, so we've kind of moved, you know, to that for our own internal stuff.

[00:19:26] Lukas Biewald: But, um, you know, I'm sure new stuff will come along. The thing is, though, it's actually not that hard to, to, to try a new one, you know, so I don't like, I think it's okay to try different ones. I think, you know, HuggingFace can get a little complicated because there's just so many of these, um, you know, these models out there.

[00:19:43] Lukas Biewald: Um, but I think, I think Mistral would not be a bad choice right now.

[00:19:47] Krishna Gade: Yeah. Yeah. Yeah. I think, you know, it seems like there are two kinds of LLM models that are, they're there for like these large use cases, like, you know, OpenAIs and Anthropics in one bucket and the LLaMA and Mistrals in one bucket. And I think people are picking both in some cases for different use cases in their organization.

[00:20:03] Krishna Gade: So that's, that's, that's great. So, so I guess like You know, another fundamental question that people have is like, okay, I want to get started on this LLM journey, but I want to collect my corpus, you know, the collecting the corpus in enterprise is a challenge. You know, how do I go about it? You know, any, any recommendations, you know?

[00:20:20] Lukas Biewald: Well, it's, that's like, so that's so use case dependent, you know, it's like really hard to answer that. Um, you know,

[00:20:29] Lukas Biewald: generally I think it's

[00:20:31] Krishna Gade: more of a data, data, data, data problem. Yeah.

[00:20:34] Lukas Biewald: Yeah. I mean, one thing I guess I'll tell you, like for one application, um, you know, I asked GPT for it to generate some, um, you know, kind of relevant examples and that honestly was a pretty good start.

[00:20:47] Lukas Biewald: So, you know, I've actually heard of other people doing it. I did that myself. So, um, not a crazy, you know, kind of way to get your first. So, you know, you really want a human, like in any product that you launched here where the humans can give you feedback and then you can get more data to feed back in and to your model if at all possible.

[00:21:05] Krishna Gade: So maybe a related question, you know, touching upon the problem of testing and evaluation, right? You know, for any test, like in software, you write unit tests and you create data sets for the testing software. And, you know, for this, you know, you may want to have a lot of test prompts, you know, for the questions that your chatbot needs to answer.

[00:21:21] Krishna Gade: How do customers can go about creating large test sets? You know, do you think that's a problem that, you know, that one needs to solve in terms of testing LLMs?

[00:21:31] Lukas Biewald: For sure. Um, and I think like, you know, you may want like multiple test sets. I think like, you know, I think, I do think people get a little, well, I don't know.

[00:21:39] Lukas Biewald: I don't daunted, right? Like if you create like a hundred test cases that will actually tell you if the model is getting a lot better or a lot worse. Right. So like, you know, having, you know, Some kind of sane, you know, sized thing to start with is not necessarily, um, you know, that is like a much, much better than testing by vibes, which, you know, obviously is like stay there for a lot of people.

[00:22:02] Krishna Gade: Yeah, yeah, yeah. And so I guess this is also where I guess model graded evaluation could help, right? You could probably have models generate questions for you these days. Yeah.

[00:22:10] Lukas Biewald: Yeah, totally. Although, you know, I will say like, um, this might be going back to my crowd sour days a little bit, but. There is really like an anti pattern where some people like, don't look at what their model's doing, like hardly at all.

[00:22:22] Lukas Biewald: You know, I think like, I would just say like forcing yourself to look at a hundred cases of exactly what your model did. Yeah. It's a really good idea that you'd be surprised some people kind of skip that step. So labeling your own test cases is not. You know, the worst thing to do.

[00:22:37] Krishna Gade: Yeah, yeah. So, so maybe, uh, slightly, uh, uh, you know, more like weights and biases related questions, right?

[00:22:42] Krishna Gade: You, you, you started the journey on experimentation of, you know, traditional ML and deep learning models. You know, how, how have weights and biases evolved? You know, what are some of the differences that you're noticing between experimentation of ML/DL into experimentation with LLMs?

[00:22:59] Lukas Biewald: Yeah, I mean, first of all, I mean, you know, just to, to, to brag a little bit, you know, we feel like Mistral was built with weights and biases as was Llama 2, as was, you know, GPT.

[00:23:08] Lukas Biewald: So all these things were actually built using, um, weights and biases. Um, that's it. It doesn't necessarily follow that, you know, if you're... Doing prompt engineering, you know, on one of these models, you want to use the same product, but we really want you to, right? So, you know, we, we've put in a lot of work to make, you know, something called Waste and Bytes as Prompts, right?

[00:23:28] Lukas Biewald: Which is a, you know, a prompt focused, um, you know, offering where you can, you know, see things like when you, you know, chain together multiple prompts, you can like, you know, kind of trace. Through it and see what happened, um, and, and, you know, see how you know your results, compare it to other situations. Um, also fine tuning has gotten super popular in the last couple months.

[00:23:50] Lukas Biewald: I mean, I think, like, you know, when, when Llama 2 came out, um. That was significantly better, I think, than the other big models that had been, you know, open sourced prior to that. And QLoRA made it a lot more feasible to fine tune models on your own hardware. Um, and so, you know, fine tuning is a lot like training a model where you want a lot of the same metrics.

[00:24:13] Lukas Biewald: And so, um, at this point, I think fine tuning is the most common. Um, you just go up with some biases. Yeah. I mean, the other thing that's interesting is like when we started with some biases, you know, I think deep learning works super well for images and audio and actually biology and chemistry. And so, you know, that was like the majority of our, um, use cases was more of this kind of rich.

[00:24:36] Lukas Biewald: Media stuff. And I think lately we're seeing a big shift more towards text because so many of these applications like suddenly work so well, um, on text. Although I think with the multimodal models, um, coming out that may rotate, um, you know, back again. But I think like... You know, bottom line is like a lot of new stuff works.

[00:24:53] Lukas Biewald: And so, you know, people do a lot more variety of stuff on top of Weights & Biases.

[00:24:58] Krishna Gade: Yeah, interesting. Yeah. I was playing with DALL·E 2 the other night to create a cartoon. And, you know, as you know, like it was not following my instructions that well. So it's pretty interesting. Awesome. So I guess like, you know, going into that hardware issue you touched upon, right?

[00:25:12] Krishna Gade: You know, uh, you know, there's like the large models and the small models today. And, you know, for ML teams in enterprise that may be GPU poor, you know, maybe they don't have access to the 100s. And what is a good way to get started on it and still deliver a differentiated user experience?

[00:25:29] Lukas Biewald: Are you talking about like using an LLM?

[00:25:30] Krishna Gade: Like, yeah, using LLMs or, you know, getting LLMs deployed. You know, do you think like, you know, if I don't have access to GPUs, I could still go ahead with Mistral or Llama 2 and productize something in my, in my, you know, in my workflow?

[00:25:42] Lukas Biewald: Oh yeah, definitely. Um, I mean, you know, you can, um, I mean, there's some great, um, open source projects out there like LLAMA, um, CPP.

[00:25:51] Lukas Biewald: Um, and, you know, like, again, if you really, really care about latency, this could be... a challenge for you. But, um, you know, I have no problem running, um, I can't run the 70 billion parameter model, but the smaller ones that are still enormous, right? Like, you know, 13 billion parameters, 6 billion parameters.

[00:26:10] Lukas Biewald: Those, um, those run fine on my laptop and on my, um, on my desktop. And, um, yeah, I don't, I think it's a, I think people are a little, I feel like they're a little overly focused on how hard it is to deploy these models, but they run fine on modern hardware.

[00:26:29] Krishna Gade: Awesome. Great. I think, you know, we are coming up on the clock.

[00:26:35] Krishna Gade: You know, I think, you know, this is a great conversation, Lukas, as you all have heard from Lukas, you know, get started on this journey. Start small, iterate your way towards it, towards the top and, you know, evaluation. Don't forget about it. You know, try to make sure that you're testing your L LMSs and, and you know, and, and sort of, yeah.

[00:26:53] Krishna Gade: You know, I think this is a great opportunity for all of us to build some amazing apps. Um, thank you so much Lukas, for spending time with us. Uh, you know, uh, and, and, um, yeah, you know, we'll, we'll share this, uh, with you later and, um, good luck with everything that's going on in weights and vice. Awesome.

[00:27:10] Lukas Biewald: Thanks, Krishna. Great task.

[00:27:12] Kirti Dewan: Thank you. Thanks, Krishna. Thanks, Lukas. That was a jam packed session. Love the lots and lots of easy nuggets that we have, especially the how do organizations, how they actually do it and how they should do it. Testing with vibes. I think, Lukas, that's going to become a term that many people are going to start using now, uh, when they...

[00:27:33] Kirti Dewan: When they're training their sales teams or those who are external facing, hey, watch out for that. And then the FOMO and the FOMU was, uh, is in school as well. And then as Krishna said, plus one to the take baby steps and try all the models out there and across the, try the different methods across the different GPT models.

[00:27:50] Kirti Dewan: So really nice learnings. We covered a lot in, uh, 30 minutes.