AI Forward - The Science and Engineering Behind LLMs

Table of content

In this keynote presentation, Hanlin Tang discusses the science and engineering behind LLMs in enterprise applications. He highlights the choice between buying LLM services from external providers and building custom models, emphasizing that specialized domain-specific models can provide unique advantages for enterprises. Tang also addresses best practices for data selection, cost-effectiveness, and production deployment of LLMs, and mentions the importance of a unified platform for handling various aspects of model production and governance within enterprises.

Key takeaways

  • Diverse Models in Enterprise: Enterprises are moving beyond the idea of a single dominant large language model (LLM) and are exploring a variety of specialized models for specific use cases and domains.
  • Buy vs. Build Decision: Both buying and building models have their advantages, depending on factors like complexity, cost constraints, and data ownership. Enterprises should decide whether to buy an external API or build their own models based on their specific use cases.
  • Cost-Effective Model Training: There’s a misconception that training large language models is prohibitively expensive. Training large language models is becoming more accessible and cost-effective, with examples of models being trained from scratch relatively inexpensively.
  • Increasing Value in Specialized Models: Smaller specialized models with 3 to 13 billion parameters are gaining importance for business applications, and can be more practical and cost-effective than extremely large models. The choice of model size should align with the use case and cost considerations.
  • Data Selection Best Practices: Data selection plays a crucial role in model training. Depending on the application, data can be fine-tuned, mixed with proprietary data, or curated to ensure domain relevance.
  • Standardization of LLMOps: Beyond model development, enterprises face challenges in deploying models into production, monitoring, data storage, and governance, which require unified platforms and tooling for successful AI solutions.

Speaker: Hanlin Tang - Co-founder of MosaicML and CTO of Neural Networks at Databricks

Video transcript

[00:00:04] Kirti Dewan: Welcome to our keynote on the science and engineering behind LLMs in the enterprise by Hanlin Tang. Hanlin is the CTO of Neural Networks at Databricks. Prior to that, he was a CTO and co founder of MosaicML, which was acquired by Databricks in July of this year. Hanlin, thanks so much for being here today.

[00:00:25] The stage is now yours.

[00:00:26] Hanlin Tang: Hi, good morning, good afternoon, or whoever folks are. It's great to be here, um, and thanks a lot, uh, for Kurti for the, for the nice introduction. Um, as I said, my name is Hanlin Tang, uh, formerly C2N co founder of MosaicML, now very excited to be, uh, here, part of Databricks. Uh, let my, let me share my screen really quickly, uh, one second here.

[00:00:56] All right. Yeah. So what I had prepared for today's keynote is a short discussion on science and engineering behind LLMs and enterprise. And what I mean by that really is how do we see enterprises today approaching, uh, the problems, uh, in the LLM space? What are the best practices that, uh, people are using, uh, to successfully deploy these new technology in production?

[00:01:21] And how do people navigate the choices of different technologies to use when determining what's the best for your specific use case? I think when, you know, ChatGPT first burst onto the conscious world's consciousness, you know, over a little less than a year ago, uh, the first sentiment was that There would be a few large models owned by a few players that would essentially capture the market.

[00:01:48] And there would be these billion parameter models that everyone would access behind closed APIs. I think what's been emerging that's been very exciting that we've been privileged to be able to participate in over the last six to nine months is the understanding that Actually, um, it's not just one large model that will rule them all, there's going to be a Cambrian explosion of many different models for very specific use cases and specific domains, powered by very specific proprietary data, uh, that individual customers, uh, may, may have.

[00:02:24] And it's not just us, this is, um, a blog post from a few luminaries in the field, uh, arguing the same, uh, that we see a world... where there are these specialized AI systems that will solve very specific problems extremely efficiently. And that a lot of companies, including many of the enterprises participating here today, uh, there will be lots of unique data, um, that can become a clear advantage when training their own models.

[00:02:52] And so we believe in a world where for some applications, companies will buy, so use an external API like OpenAI to cohere many of the excellent models that are out there. But for other applications, they will also be empowered to build their own models, either fine tuning from an open source model or pre training your own model from scratch.

[00:03:16] Let me just check in. Can folks hear me now? Hopefully. I'll continue on. Definitely ping the chat if you're unable to hear me. We can hear you fine, Alan. Okay, fantastic. So as I said, there's a role for people by and both build. And at Mosaic and now at Databricks, we want to help Uh, both sides of the coin.

[00:03:35] I think a very common thing that we see at Enterprise is, well, how do you navigate between those two? If you have a specific use case or application, when do you buy and when do you build? So I wanted to briefly talk through some of the pros and cons there. Well, the way to approach this problem is really around taking the right approach for each use case.

[00:03:54] Navigating, are you a very, is this a very complex reasoning task? Are you extremely cost constrained? So in that case, maybe building your own and serving your own may make more sense. Or are you working in a very tight latency environment, such as like a code IDE assistant model, uh, where you may need to have very tight controls, so you might want a really, uh, very small model to be successful.

[00:04:16] And there are strategic questions about data ownership. vendor lock in, and other sorts of challenges. So, BAI is actually very well known, right? There are these providers out there that provide very powerful models. We just saw the OpenAI Dev Day earlier, um, this week. Uh, very fast and easy to prototype and iterate.

[00:04:35] On the other hand, uh, there are some cons as well. Um, these cons Sometimes they're very expensive to scale out, uh, in production. Uh, you have a challenge of not being able to own the model and take it to whoever you want in order to run inference. There are questions of potential reliability and scale you have to reason about.

[00:04:52] And then also, of course, data privacy concerns, uh, with sometimes using these API based providers. On the build side, there are also pros and cons. Many reasons we've seen successful enterprises build their own models are, for example, uh, data provenance reasons. If you're in the finance space, maybe you want a model that hasn't been trained on, say, reddit wallstreetbets, right?

[00:05:15] You want to have full control over The exact training data that went into that model, either for regulatory reasons, for reducing hallucination, uh, or, or toxicity. Sometimes it's more cost effective, because some applications, you just don't need a very large, powerful model. You're trying to solve a very specific business problem, and so training a smaller, But equally powerful model for your use case is it can be more cost effective to serve.

[00:05:42] Your data is your IP. If you're infusing this model with your own data, and importantly, potentially with your customers data, you really want to have very tight control of those model weights is can be very important. And then for very specific domains, such as coding, genomics, healthcare, we've seen more domain specific models become quite powerful across many of these applications.

[00:06:08] And then lastly, uh, model ownership standpoint, you know, some customers, uh, vision critical applications, they want to be able to own the weights. They want to be able to deploy this model across many different regions for low latency. They want to be able to move this model across different inference providers, depending on what's the best technology at the time of the day, as opposed to Um, tying yourself, you know, to a particular, uh, provider that's, that's out there.

[00:06:34] And so these are the many reasons why someone may want to build their own model depending on their use case. Um, and it really centers around your data and you want to use that to your own advantage. But what I hear a lot is, okay, that's great, but man, building is hard. And expensive, right? Um, some news articles out there saying that training GPT 3 costs 5 million to 15 million dollars.

[00:06:59] Some of these models, you know, are training on tens of thousands, maybe even more, um, of GPUs. Um, this doesn't seem This doesn't seem very, very tractable to do. I think the reality that we've seen, and it's very exciting to see the tooling emerge over the last, you know, six to nine months, is that training LLMs now is actually quite accessible.

[00:07:20] Um, just as an example, um, uh, uh, Repl. it, uh, which is one of our customers, uh, they trained a three billion parameter, uh, code completion model from scratch. Um, cost two engineers about 100, 000 in a few days to train, uh, and by training this model completely from scratch and incorporating some of their own data, uh, they're able to meet a lot of the, uh, needs of their, of their customers, um, but in a lot faster, lower latency, um, scenario.

[00:07:51] We have academic labs training large open source models like OLMO, uh, which when done be very powerful and a lot of people are using these open source models like like LLMA models, MBT models, etc. to then fine tune your models for specific use cases. Stanford has trained biomedical specific large language models.

[00:08:11] Some of our customers have been able to easily train these models and deploy them into production and see the ROI, increasing the accuracy of the model or reducing time to solution. And training these models is not A multi million dollar, um, uh, adventure, um, now with the tooling that's available from us and other people in the market, um, building your own model is as powerful as, as ever, which is very exciting to see, um, all these very domain specific models starting to emerge into, into the market.

[00:08:42] The other thing, you know, I think early on there were sort of, oh, you know, GPT-3.5 or GPT-3 was 175 billion parameters. We're going to get a trillion parameter models, 2 trillion, 5 trillion, 10 trillion parameter models. I think what's actually emerged over the last year is the understanding that smaller specialized models have very strong business use cases.

[00:09:04] and why do I say that? It's because many business applications, you don't. You're not, you don't actually need a large model that can say, like, write, poems in a particular writer style or, translate, a drawing into HTML code. you're trying to swallow a very specific information retrieval problem, customer support problem,

[00:09:29] search problem, conversational problem that's very narrowly scoped to your products. and in those scenarios, we've seen a sweet spot within enterprises of the 3 to 13 billion primary range, balancing that, complexity of your use case against the cost of inference. Because sometimes the large models, when you start deploying across tens of millions of users, no longer becomes very viable at scale, for these very, very large models.

[00:09:59] And even if you were to train your large models, It's not actually that expensive. We put out a blog last September, actually, pre, uh, ChatGPT, uh, characterizing the training costs, uh, for different, uh, sizes of models. And you can see, you know, from 1 billion to 30 billion parameter models, these costs are less than 500, 000.

[00:10:19] Many of these are within the range of even just doing an enterprise POC, right? Training a 7 billion parameter model from scratch to this is to chinchilla scale, um, costs around 30, 000 and it can satisfy many, uh, many business use cases. And you can get to GPT 3 quality with around half a million dollars.

[00:10:37] Now, just to be clear, this is GPT 3 quality, not 3. 5 Turbo, not ChatGPT. Those are Certainly require additional amounts of training. But for a lot of base use cases, this is actually very usable and this allows you to really own the model, determine the training dataset mix, and move into production a lot cheaper on the inference side.

[00:11:02] And just to demonstrate this, you know, we used our own stack to train foundational models, MPT7B and MPT30B that were released earlier this year. Um, and... Uh, these are, uh, very powerful models. Uh, I think the 7b model, I haven't checked in a while, so it may be out of date given how quickly the, the field moves, uh, but when I last checked, it was the most downloaded large language model on HuggingFace.

[00:11:26] And these are very accessible models to even train from scratch.

[00:11:32] I think probably one of the largest and most important questions we get from enterprise customers is, okay, that's great. What are the best practices around data selection? Why would I want to train my own model from scratch versus taking an open source model and continue to train, um, uh, some of these, uh, some of these models, right?

[00:11:54] Um, and actually the, and... Uh, here I am actually showing kind of the training data mix we use to train a 7 billion parameter model. Uh, and you can see it's actually a very fine tuned, uh, fine tuned is the overloaded word these days. There's a very curated, uh, proportioned selection of datasets that we selected to include in MPT7B.

[00:12:14] Uh, we elected to focus only on English. Uh, we wanted to have a, have it a little, be a little smarter and a long form, um, reasoning, so included Semantic Scholar, which is this amazing data set of academic papers that are available in the, uh, publicly available, uh, curated by the folks, great, our great friends at AI2.

[00:12:33] And I think what we see for data selection is Depending on what kind of end application you want to do, you go about your data selection very differently. Some customers want to pre train their own models from scratch because they're operating a very specific domain like finance. And they essentially want the whole trillion tokens being used to just completely be finance related, right?

[00:12:54] They don't want to, uh, have to deal with potential hallucinations and issues with other types of data sets entering this, this model. And so many of the best. Uh, use case we've seen is balancing about 50 percent public data, uh, so taking the common crawl and filtering it to only including finance specific articles and data sets and information, and then connecting that with your own proprietary data to really define this domain specific large language model.

[00:13:21] We see the same thing on the language side, training a model that's sort of like 50 percent uh, English, 50 percent Korean, uh, as an example of one enterprise customer that's serving the Korean market. And so this is really, uh, a good way to filter the data, to really tightly control the model. And that actually saves a lot of downstream pain in, uh, fine tuning and alignment in RLHF.

[00:13:44] Um, because... Some of the toxic content just never entered the model itself. You can't take a model that's been trained on, say, a lot of health data and ask it how to make a bomb, because that's never been included in the, in the training data set. The other common tactic we see is, hey, let's take an open source language designer model and let's continue training it on my own proprietary internal data.

[00:14:06] And this can be very useful for downstream RAG applications, because you want to teach a large language model the jargon and specific context of your company. The acronyms you use in your operational manuals, the terms that may be very specific to your domain. Every company with a long history, um, has these developed, and that can make These models are a lot more powerful at understanding and interpreting the data that's being seen or fed in from a vector store somewhere inside a RAG application.

[00:14:38] And so here, this is really a mix of, say, like your internal Wikipedia pages, user manuals, user facing documentation, code comments, we've seen some customers include in here as well, to really make this open source language model really start to speak your, the language of your company. And the third, you know, is an instruction fine tuning, where you want to guide the LLMs behavior to a very specific task.

[00:15:03] But also we see best practices here in using it to adopt the voice of your company. Maybe you want this large language model in conversation with your customers to very speak with a certain diction in a very particular way. Then you would use instruction fine tuning, which is a combination of prompts and queries, to then guide the model.

[00:15:22] And what's interesting is we learned over all of our science and experimentation that the quality of the data required increases as you go down this, across these three different levels. So pre training models from scratch, the quality data is somewhat important, but not super important because there's just so many tokens you can push it to the model.

[00:15:42] On the other hand, in instruction fine tuning, the quality of those examples are super important because the model will pick up on every single nuance there, and so filtering and cleaning up the examples from instruction fine tuning have a huge impact on the quality of these R generated models when it comes to downstream deployment.

[00:16:03] The other piece of best practices we've seen is that good data science is Baked on good infra. I mean, Bill, working with these large diagonals can be very difficult because of the underlying GPUs and dealing with CUDA drivers and InfiniBand drivers and multi node scaling and all the systems infrastructure challenges.

[00:16:22] You know, we put out many different blogs about ways that we've solved this problem here on Mosaic and now at Databricks. I encourage folks to go out and check those out. But that gives you superpowers, such as when I talk about different data mixtures for fine tuning, you can easily pick different streams of data.

[00:16:39] I want 30 percent from C4, 5 percent of code from, you know, or 5 percent of markdown. 33 percent from this multilingual dataset, and we can easily mux these on the fly, so you can easily experiment with different dataset mixtures without having to like recreate and reformat the dataset all the time. Or to do live complex evaluations during training if you wanted the model to answer financial knowledge Q& A questions, you know, you can see how that evolves over the course of training.

[00:17:06] And also ways to train models from scratch where we handle all the system infrastructure challenges. The industry is starting to develop all this awesome tooling, us included, to give data scientists like yourselves the great infrastructure to then go in and solve these problems. I'll close by saying that, um, a lot of the challenges we see moving into production is Oh, okay, great, I have a model.

[00:17:32] Either I fine-tuned it, I use an open source, or I have an endpoint from OpenAI that I'm calling. That's fantastic. How do I get this thing into production? And many times it's about more than just the model that we see being this blocker of getting stuff into production. It's How do I monitor it? How do I store the results of the output of these models?

[00:17:52] I can do like downstream backtesting and toxicity filtering and correction. How do I use this data to further improve my model? How do I use the human feedback and record it, um, to improve the model? You know, over time, how do I detect drifts in the model? How do I govern all the data that's being captured and make sure that models are not leaked, that like data sets from customers are not inadvertently shared with other customers?

[00:18:15] That models that are trained for very specific customers are inadvertently used and applied in other scenarios. How do I take care of all of that? And so we've learned that it's, yeah, it's really just more than just a model. It's everything beyond that. And that's where we see a lot of companies get stuck.

[00:18:32] moving into production, because they've spent a lot of work on the model, on the endpoint, on the prompting piece of the house, um, but then when they try to move into production, all these other roadblocks start, start emerging. Um, that's really one of the principal reasons why we were really excited, uh, at Mosaic about joining Databricks, is that Databricks does have this unified platform and tooling, uh, to handle all those pieces from the serving to the monitoring to the improvement, um, of the model.

[00:19:00] All kind of unified under a single data source. And that's really changed the game internally about even how we do our own model development. I would say like our data filtering and pre processing, pre acquisition and post acquisition has probably sped up by, you know, 5 to 10x in some cases. And we've seen a lot of enterprises be able to unblock themselves.

[00:19:23] by using tooling like, you know, the Databricks Lakehouse or other types of tooling that offer this kind of end to end journey, all based on really governed and unified data. And that really helped close that gap between going from model into production. Along those lines, you know, we're really excited.

[00:19:40] We put out jointly this blog on this integration between Fiddler and Databricks to help with a lot of this as well. You know, at Databricks, We want to integrate with many different tooling, uh, as much as possible. Um, and so I really encourage you. I put a shortened Fiddley link, uh, Bitly link here, uh, to the blog, uh, to check out how we're accelerating, you know, production of these AI solutions by both, by integrating Databricks, uh, with, with Fiddler as well.

[00:20:10] And so, as I said, we're really excited at Mosaic to now be part of Databricks. Uh, look forward to, you know, the future, uh, for some amazing announcements about how we're going to be, uh, working together and integrating our stack into the Databricks platform. Um, and so thanks a lot for, again, for the, for the invitation to, to present here today.

[00:20:29] I think, uh, there's about 10 minutes left, I was told to reserve, uh, for questions. Uh, so we are, we are right on time. Uh, so let me go ahead. So if you have any questions, you know, please feel free to. pop it into the Q& A, uh, and, uh, I will, uh, answer these, uh, one by one. Uh, so the first great question is, do these costs include instruction tuning as well?

[00:20:51] Uh, they do not. Uh, fortunately, the cost of instruction fine tuning are a fraction of the cost for, uh, pre training the model. So when we train MPT, uh, 7b, it was around 200, 000 to train from scratch. Uh, we also released, uh, StoryWriters, like a 64k context size version, which was the first of its time. And fine tuning that particular model, even though it was a long context model, was only a few thousand dollars.

[00:21:16] Uh, so the costs there are relatively minimal. Uh, one question was, what does it mean to be less than one epoch? Does it mean you didn't use all of your data? I'm looking at the best practices data selection side. Uh, yeah, we sometimes, let me, uh, we sometimes, yeah, we sometimes don't use All the data, because some of the data is just too much to train the model through.

[00:21:37] So for example, right, MC4 here has 2. 4 trillion tokens of data. We're only using 330 billion tokens of it. Uh, this, these things can be very compute intensive and you start getting diminishing returns as you pour and pour more tokens of data in that oftentimes you do, you may not necessarily want to use all of the data.

[00:21:58] Um, I'll switch to a different question. RAG is the popular use case nowadays. How do fine tuning and prompting models, uh, interact with RAG? Great question. So, Um, uh, we've seen in many cases where fine tuning models will actually help the downstream RAG application. Um, examples of this are, as I mentioned, kind of teaching it the jargon and specific context, uh, of your data could help it better understand, that can help the generation model better understand sort of what's...

[00:22:29] the retrieved material actually means. So that's one example where you might fine tune a model, an open source model, and then use it inside a RAG application. Another question I have here is, I am a student, and for students, what is the recommendation to train in a cost effective manner while utilizing GPUs?

[00:22:51] Great question. I would strongly encourage not to kind of DIY your own trainer code. There are a lot of great frameworks out there. PyTorch Lightning, there's one that we've built in the open source called Composer, which is very optimized for training of large language models and at scale training. Fast.

[00:23:12] ai has like an amazing layer on top of PyTorch. Please use those because they have... Mixed in a lot of the best practices for doing really good scaling and effective use of GPUs and making sure your tensors are all moved to the GPU and solving a lot of the data loader bottlenecks. When do you decide to pre train a model from scratch versus fine tuning from open source models?

[00:23:34] Great question. So a lot of our customers follow this journey of, hey, let's start an open source model and fine tune it. And let's see if this can meet the quality of what I need in production, um, and reduce our costs. And if that returns positive, then they might start asking, okay, well, that was a LLAMA13v model fine tuned.

[00:23:54] Can I cut my costs even more by pre training a 3D model from scratch? And can that meet the same quality target? If so, I've then cut my inference costs in half, which is super important. Also the latency in half for a much better customer experience. And so a lot of our customers will go through this progression to best understand how to work.

[00:24:15] Um, let's see. Uh, what do you think of hallucination? Still a massive challenge. A rag is an. That's a great way to reduce hallucination, because you can ground the generation in real data that's retrieved from your own vector stores, um. We'll find out over the next year, but our hypothesis also is that domain specific models can also reduce hallucination because they haven't seen a lot of stuff out there.

[00:24:40] It's very narrowly scoped to the domain that you're actually, um, interested in. Um, let's see, um, training on existing model using enterprise data. How can I control spillage into existing trained data? Um, so the, the way our customers approach this is they will mix Uh, the cost their own enterprise data with public data into one large data set to then train this model for.

[00:25:07] And then during inference generation, you just want to make sure that it's not, uh, you know, leaking any of your internal enterprise data. You might also want to filter the data for PII or sensitive information, uh, in order to reduce that, that concern as well. Um, Can a model trained on Mosaic live outside of Mosaic?

[00:25:26] Can we export the model and make inference outside a Mosaic ecosystem? Yes, absolutely. This is one of the great benefits of building your own model, either within Mosaic or with other tools or with Databricks, is that you own the model weights. It is a PyTorch checkpoint that lives inside a datastore that you control, and then you can take and do whatever you want.

[00:25:45] You can go deploy it on a laptop. You know, for all your, uh, engineers that are out in the field, uh, servicing, you know, internet, uh, solving internet issues, you can put it anywhere, anywhere you would like. Um, someone asks, can you do deep learning inference on CPUs, even if you don't have access to expensive GPUs?

[00:26:03] Uh, you can. In a prior life, I ran the Intel AI lab at Intel Corporation, Spent a lot of time, um, optimizing inference on CPUs, uh, so it's certainly possible. I would say that, uh, like from a performance, price performance standpoint, at Mosaic and Databricks, we usually see GPUs being still the most popular and also the most cost efficient.

[00:26:24] Um, but if you don't have a lot of queries or distributing at large scale, then CPUs may still make sense. And then for the really small models, you know, a lot of laptops these days, the CPU has an integrated GPU and like an integrated GPU or compute engine inside. And you can also do effective inference there as well.

[00:26:43] Um, how can we use internal data for training model? Just pump it directly into the model for training? Does it need to be formatted in some fashion for fine tuning? Good question. So in the continued, uh, training section that I, that I have here on this slide, uh, you really just pass it straight to the model and you ask to predict the next token.

[00:27:03] For fine tuning, depending on what type, if you want to do instruction fine tuning, then you do have to structure it into a prompt and query. So, for example, if you have a long history of, say, customer conversations over the last 10 years, right, you can mold these into the right format that the model expects for instruction fine tuning.

[00:27:22] Moving on here, wow, a lot of questions. Thanks for sending all these amazing questions here. I will try to walk through as many of these as possible. I might sub select a few and maybe I'll answer some offline later as well. Let's see, uh, for enterprise specific tasks, does instruction tuning data require a mix of general tasks like summarization, data extraction, enterprise specific demonstrations?

[00:27:48] For optimal, uh, quality, uh, yes. That would be, that would be ideal. Um, it really depends on the end use case that you want, uh, this specific task in your enterprise to do. For this summarization, yes, you want to start teaching, well, how in my company do things generally summarize? Do you want to like them to be summarized in bullet points?

[00:28:08] Paragraphs, uh, what key points should a summarizer, a good summarizer, pay attention to, um, that's really meaningful. How does it align with the values, uh, of your, of your company? Uh, whereas in, in other cases, yes, you may want to do more conversational, uh, instruction fine tuning. Um, let's see here, uh, you mentioned governance briefly.

[00:28:29] What does that mean for the enterprise? Great question. And this might be the last question I have to take, uh, given the time that's, that's left, uh, here. I'm happy to answer more offline. What does this mean for enterprise? So one of the key challenges with enterprise is, okay, you have all this data sitting in a data warehouse somewhere.

[00:28:46] How do you govern it? Meaning, how do you do access control, right? Some of this is super secret internal financial data. Some of this is customer data that you want to keep segregated and only have access to, say, customer support. Support specialists. Some of this can be more general purpose data within the company that you want to allow to be more broadly available.

[00:29:05] Sometimes in finance, you have buy side and sell side, and buy side and sell side cannot see each other's data for conflict of interest reasons. Maybe you want to segregate some really specific NDA data from data that your sellers can see. And so to govern all of this, you really do need a governance platform to easily track and manage and do the access control needed for the data.

[00:29:28] And because models are trained on data, you then also want to have governance on the downstream models as well. Okay, I think that's all the time I have. Thanks a lot for the invite. Uh, very excited to, to be here today. And, uh, I will try to get through some of these questions offline as well.

[00:29:47] Kirti Dewan: Thanks so much, Hanlin.

[00:29:48] That was, that was really awesome. And that was, uh, that was a great set of questions flowing in and we'll try to get them to you so that you can answer them offline as well. Um, that would be really helpful for the audience. I love what you said about the best practices around data selection and then how you did the 50 50 split between language and then public and proprietary data.

[00:30:11] But my favorite was where you said fine tuning is loaded word nowadays and how about we use curated. Thanks so much.