AI Forward - Chat on Chatbots: Tips and Tricks

Table of content

In this workshop, Murtuza Shergadwala shares his experiences in building the RAG-based Fiddler Chatbot, and provides tips and tricks on prompt engineering, document chunking, managing hallucinations in responses, improving user trust through UI/UX design, and improving the chatbot with evolving documentation and user feedback.

Key takeaways

Document Chunking and Retrieval Effectiveness: Efficiently chunking large datasets is essential for a chatbot's ability to accurately retrieve and process relevant information within its context window.
Iterative Process in Prompt Engineering: Developing chatbot prompts is an iterative process, essential for improving response effectiveness. Adapting these prompts based on real-world feedback significantly boosts the chatbot's precision and utility.
Strategies to Address Hallucinations and Maintain Accuracy: Addressing chatbot hallucinations involves strategies for identifying and reducing inaccurate responses, underscoring the importance of continuous monitoring and refinement for chatbot reliability and trustworthiness.

AI Forward Summit - LLMs in the Enterprise workshop presentation titled 'Chat on Chatbots: Tips and Tricks,' sharing insights on building Fiddler's RAG-based chatbot, along with tips on prompt engineering and improving chatbot performance.

‍

Speaker: Murtuza Shergadwala - Senior Data Scientist, Fiddler AI

Video transcript

[00:00:00]

[00:00:04] Karen He: All right. Welcome back. Our last workshop for today is on a chat, on chatbots, where our senior data scientist Murtuza Shergadwala will share tips and tricks on building a chatbot. In his presentation and workshop, we'll have a poll for you to answer. And if you have any questions or comments throughout the workshop, please put them on the chat or the Q& A section, and we will address them, um, afterwards.

[00:00:35] Take it over, Murtuza.

[00:00:40] Murtuza Shergadwala: Thanks, Karen. Uh, hello, everyone. Uh, let me share my screen first.

[00:00:48] All right. Hello, everyone. My name is Murtuza Shergadwala. I'm a senior data scientist at Fiddler. My background is in human centered design and decision making. And today, my presentation is going to focus on my learnings while designing the Fiddler chatbot. So, the Fiddler chatbot actually started, uh, as a side project, uh, during our company's hackathon where we can do whatever we want, almost whatever we want for a day, and just come up with something.

[00:01:17] And so, I decided to use our Uh, documentation, uh, to build, uh, a chatbot, uh, as an LLM application. So, um, these are all my lessons. Uh, I have deliberately not updated the lessons after the OpenAI Dev Day announcements, cause it is, uh, going to help us appreciate the fast moving nature of, uh, the space that we are in.

[00:01:41] And plus, uh, still some of the relevancies of these lessons, um, is something that we'll discuss. Um, feel free to ask me questions on these. Uh, but before we start on all of that, uh, let's kind of, uh, first give a little introduction on Uh, this RAG or Retrieval Augmented Generation, um, functionality with LLMs.

[00:02:05] Um, and so here is like the canonical, um, structure, uh, for this, um, where, you know, you have your, uh, application, a chatbot application, where you want a user to be able to ask questions and you, and we want to leverage the ability of a large language model, um, Uh, in a way that it can, you know, form coherent responses to the questions that are being asked and leverage that capability, but then we also want, um, uh, to use, uh, knowledge base, uh, that an enterprise might have, or some of these, uh, uh, specific documents that you might have on which, you know, the user is asking questions.

[00:02:46] So leverage both of them, right? So that's where it comes in. Retrieval Augmented Generation Structure comes in, where there are these different, um, you know, um, functionalities that need to be, uh, ensured so that you can get a good sort of chatbot. So, you can see that, like, you know, the user is asking this sort of question, um, there is this prompt processing that you might do, there is a way of actually, um, leveraging certain prompt styles, and these prompts are essentially instructions that you need to send to the large language model.

[00:03:17] But then the retrieval augmented generation or the retrieval part is when, you know, the question that the user is asking results in finding certain documents that are relevant to that question that is being asked and then going to your database and then retrieving that information. Once that information is retrieved, then you not only give your prompt, which is, you know, your instructions, but then also the query.

[00:03:46] as well as the documents that you have, uh, retrieved from a database, a vector database, let's say, uh, and then put it all or put all of that information into the large language model and then hopefully get a response which, which satisfies the question that is being asked by the user. And obviously when the response is being built, it's You might have certain layers, uh, which are, you know, uh, not just for a retrieval augmented generation setting, but in general, for an LLM application, you might want certain, uh, safety policies and things in place, and then you get a generated response, which hopefully satisfies the user.

[00:04:23] So, uh, having that, you know, said, uh, let's, uh, go a little bit technical. So this is just a little, um, uh, you know, somewhere midway between technical and what happened in actual Fiddler chatbot, which is what I did was for each of the documents that we have, I chunked our documentation and then generated, uh, vector embeddings, uh, for these document documents.

[00:04:46] So essentially converting this, uh, document into, uh, you know, a series of vectors. And then create, uh, embeddings for, you know, the questions that the user is asking as well, and then do a similarity search on the question, uh, which is encoded through these embeddings. And then, you know, the relevant documentation that you have in your, in your vector database, and then feed those pages.

[00:05:11] Um, in my case, I used, uh, GPT 3. 5 Turbo. Um, and I also used, um, Cassandra, uh, vector database, um, from Datastacks. Uh, as my, uh, vector storage place, um, to then essentially be able to feed all that information into ChatGPT 3. 5 Turbo and then get a response. So this is just a basic overview. Let's jump into, you know, um, some of the lessons that I have learned.

[00:05:38] So we talked about 10 lessons here. Uh, it is going to be very interesting to kind of even, you know, interact with the audience here and, uh, uh, see how these lessons are going to get influenced by recent. changes, uh, by OpenAI. But, uh, uh, you can see that these lessons touch on the different aspects of that canonical architecture that we, uh, spoke about.

[00:06:00] So firstly, what framework do you use for designing a chatbot? And I'm suggesting to start with LangChain. Then we will talk about how to process users questions, how to do chunking, how to do retrieval. So all these different aspects we'll talk about. Uh, so with this, uh, let's kind of jump into the first lesson, which is to lang chain or to not lang chain.

[00:06:19] And the reason why I'm talking about this lesson is because initially when I started with this idea of, uh, designing a chatbot, I'm also going to actually keep looking at Questions, just to keep it interactive, um, and so, um, um, uh, I initially started with this idea of like designing everything on, on my own, like all the functionalities, uh, you know, there's already a question about like, why do you need to feed similar pages?

[00:06:48] Can it not find it on its own? Blah, blah, blah. Uh, but that, that to answer that question, so I, uh, it actually is relevant to this lesson. Uh, that I thought that, you know, I want to do everything from scratch and build every functionality on my own. So I was doing it, you know, feeding things manually and so on.

[00:07:05] And then obviously I, uh, ran into, uh, LangChain as this, uh, you know, it's a, it's a, uh, open source, uh, framework, um, that allows you to develop applications that are powered by language models. And, uh, the reason why I share this lesson is because you'll realize that, you know, there are lots of functionalities that you need as you're developing a chatbot.

[00:07:26] Like, you know, users may not ask the perfect question. You may want the question in a particular style. You might want to pre process that. You might want a memory of the chat session. There are so many other things that are required. In fact, that's seen in this canonical architecture that you see. So I would suggest that if you are actually trying to get your hands dirty, um, With a chatbot, uh, start with a lang chain architecture so that you get an inside view.

[00:07:51] Uh, obviously, here we have to talk about, uh, the recent update, uh, uh, on DevDay that, uh, OpenAI shared, which is, you know, GPTs. You can create your own assistants and you don't need to know code. You can just, you know, give instructions in chat and, you know, kind of, uh, they can take care of everything in the backend.

[00:08:10] But, um, so that's great. But this lesson of what framework to use, I think GPT is an assistant framework. Like you can, you know, use that as a start. But I still feel like if you want to go into more technical aspects of understanding every functionality, what's going on in the back, what's the plumbing, using a framework where you can code up things will be better than...

[00:08:34] Um, just having a natural language to generate assistance. Obviously, I agree that it democratizes this process more and it allows everyone to participate in generating their own chatbot. So that's, that's great. But here is an example screenshot, uh, where, you know, uh, on the right side is my Jupyter notebook where I coded everything, I took OpenAI's cookbook and started creating all these functionalities and, you know, started.

[00:08:58] Feeding pages on my own and not keeping it, uh, automated as compared to, you know, lang chain framework where you can create like this conversation retrieval chain and, you know, put in a few lines of code and get the sort of same, uh, response. So, um, my suggestion would be to really use the lang chain framework and then go into all the plumbing rather than starting everything from scratch.

[00:09:21] Um, so that's the sort of first lesson here. Uh, the next thing goes into a very specific aspect that is the beginning of the, you know, this canonical structure, which is the prompt itself. Um, so obviously your users, um, uh, cannot be really controlled by you, right? Like there are humans who are going to be generating questions and obviously different people have different expertise with the same language.

[00:09:44] They might, uh, you know, have errors in the way they are creating the questions. You might expect certain questions, but you might get certain other questions. So having the functionality of, you know, processing the questions is going to be an important design decision that you will have to think about.

[00:10:00] Uh, and you also have to think about whether. Uh, you want to use the raw query, uh, that the user is asking or do you want to use some processed query or do you want to use some combination of both to retrieve the relevant documentations? So, uh, so here is, uh, you know, this, um, where did this thing go? Uh, here is a screenshot of, you know, langchain based approach where, you know, you're using certain questions.

[00:10:26] This is my template. And then there is this question generator. Uh, so this question generator, what it's doing in langchain is you, you are using yet another LLM and instantiating that. The only role of that LLM is to process the questions, uh, that the user is asking, and make it in a way that is fitting the prompt, um, um, that you want just for processing questions.

[00:10:49] And then you can input that in your, uh, conversation retrieval chain. Um, and then use that question to do the search rather than, you know, um, using the raw question. Um, so that the, this, this, uh, lesson essentially is. That's basically talking about knowing your user and expecting certain things that might happen, um, as you are, um, uh, as you are, you know, getting all these different queries from your, uh, specific base of users.

[00:11:17] Um, with this said, let's go into the next, uh, functionality, which is chunking documentation. And honestly, document chunking is really an art. Uh, yes, there are, uh, certain things that you can do, certain rule of thumbs, but, uh, Um, why do we need firstly document chunking? Right. And again, uh, based on the updates by OpenAI, some of this might be outdated.

[00:11:38] Um, but essentially, um, the, the reason why you want to chunk your documents is because of this thing called context window, which is the amount of information you can give as an input to any LLM that you are using, uh, which is called the context window, and there might be certain limitations on the amount of tokens that can be there.

[00:11:57] Uh, OpenAI recently announced that it has a 128k token window now, uh, so that's like almost like a 300 page book that can be put in. So, you might not need doc chunking for those kind of situations after all, but, um, at least for, you know, Well, just yesterday, in a way, there is this like, you know, token limit.

[00:12:18] And so what you want to do is that if you have these like really large documents that you're storing in your database, you want to be able to chunk them. And chunking that and how you chunk them becomes very important because each of those chunks are then being converted into a token. Embeddings. And so, um, again, they're like, you know, you want to be able to make sure that, let's say if you, like, for example, in Fiddler, there's a quick start notebook, right?

[00:12:44] And that quick start notebook has, let's say, 10 steps. And if you're chunking that notebook into smaller, um, smaller parts to just take care of the token limit, uh, you don't want, um, your LLM to only get, like, a part of that chunk retrieved. And then give you some half baked answer on what these steps are.

[00:13:04] So it becomes very critical for you to think through how you're going to chunk the documents and how much are you chunking it. Are you chunking it sentence by sentence, paragraph by paragraph, page by page? Those are things that you'll have to think through and experiment. So like, for example, here, um, in my case, here is a snapshot of how I'm chunking the documents.

[00:13:24] I'm doing a brute force approach where I'm just like counting the number of tokens. If it is greater than 750 tokens, um, I'm just chunking the doc. So I'm not even, you know, this was obviously a start. Um, I, I kept improving it, but, um, Even with this brute force approach of just chunking the dock based on token limits, um, it did a pretty good job of like retrieving the right documents to give the right kind of answer.

[00:13:48] But then you can see that I've done certain things here. What I've done is that, um, so there is this, uh, customer churn prediction notebook that we have, so it's just a notebook where we give this example of a churn prediction model and how you can do, um, observability on such a model using fiddler. So this is a whole quick start notebook.

[00:14:07] It's a Jupyter notebook, and it has several parts to it and so on. So I have chunked this into three parts, and you can see that for the third and the fourth row, which you can see in the screenshot, uh, I have added metadata, uh, as the starting, um, um, text. to the chunk of the doc, and the reason why I have done that is because when you're converting these chunks into embeddings, you can ensure that because of these continuities, a certain metadata that you're connecting these text documents with.

[00:14:39] So here I've used slug. So slug is essentially that last part of the URL of the documentation, which ends with customer churn prediction. I have used that metadata across all the chunks. So essentially, when you come convert that into embeddings. These points are definitely going to be closer in the space to each other, just because they have these continuity words with them.

[00:15:00] And the reason why you're doing this is because then when, you know, somebody's asking about a quick start notebook on churn prediction, all these, um, Parts of the documents can be retrieved, uh, and, you know, um, the, the LLM has a better context on how it can frame an answer based on whatever the user is asking.

[00:15:22] So this was, uh, just a little, you know, tip on doc chunking as well. Um, obviously now OpenAI is doing its own automated chunking of the docs depending upon the kind of docs that you give and so on, but to have more control over how to do it. It's important for you to have an idea on the different design decisions for it.

[00:15:41] With this, we'll go to the next lesson, which is a combination of everything that we spoke about earlier, which is, you know, the user's asking a question, and then you have your prompts, and then you have the documents that are being retrieved. The question is, how many times do you run information retrieval?

[00:15:54] Do you just retrieve based on one, one, one sort of retrieval that you do, or do you do some sort of sampling? Because you want to make sure that, you know, the documents that are being retrieved, that are being retrieved, um, have enough, um, relevancy to the question that is being asked. And, um, in this sort of a situation, um, the reason why you might want to run it multiple times is because, let's say, uh, the previous advice is something that you follow, where, um, let's say, Uh, the question that you ask is processed and when you're processing that question, it could be possible that the user was asking the right question and the question that you processed may not be the best version.

[00:16:38] And so you want to make sure that you're retrieving both on the original question and the process question and then do some sort of an. analysis or make some deterministic rules of like, you know, some intersection or union or combination of these docs so that you can, you know, ensure that the user is getting the right kind of information as a response.

[00:16:57] So here is an example where, you know, the user is asking the specific question, um, you know, which is, you know, what is model info? And so this is very specific to Fiddler. ModelInfo is actually an object where you're storing information about the, um, about your model. And so ModelInfo is actually a relevant object, but, um, just because, you know, uh, the questions are being processed by yet another LLM.

[00:17:23] Uh, the question gets processed into what information can I store about the model? So although these questions are related, they are not exactly the same, and the user is actually correct here. They're specifically asking what model info is. So I have now here shown that I'm retrieving on both the process question and the original question.

[00:17:44] And trying to see whether there are relevant pages that are showing up in this case, and then trying to, you know, add another tool that can figure out how to judge what documents to send. So here there are things that you're doing with your information retrieval system itself before it's even going to the LLM so that responses can be generated well.

[00:18:06] I'm seeing another question here, which is what do you do when documentation versions change? So that's a great question to ask. I have personally, you know, been encountering this because I am also stewarding Fiddler's documentation. So I am aware myself of all the changes that are being made in the document, Fiddler documentation, um, and trying to make sure that, you know, I'm using those pages which are changed and then converting them into embeddings and updating them in the database.

[00:18:41] Obviously, a brute force approach that you can do is, uh, just regenerate the embeddings across the entire documentation if there are major changes that are happening, if multiple parts of the documentation has been touched, so that then you can just essentially restart from scratch. Uh, I have done this a couple of times in the beginning where I have just used a brute force approach, and since generating embeddings is very.

[00:19:06] Very cheap. Um, uh, uh, you can sort of, uh, you know, just do a brute force approach. But obviously, in enterprises, if there are like a lot more documentation, you will have to start developing strategies of figuring out which, uh, are outdated. Um, rows of info in your database, uh, what are the new pages that are being added, what pages were updated, maybe your API call changed, maybe some guide changed, um, so those are things that you'll have to update, uh, and then, you know, update your vector database with it.

[00:19:41] We'll move on to the next lesson, which is obviously related to prompt engineering. In my view, it's really prompt iterative, iterative prompt building rather than engineering. And here I say that prompt engineering reflects your thinking. But thinking about what? It is reflecting. You're thinking about multiple things, so the task at hand, how you are viewing it, and how do you break up the task in hand in terms of articulation of what is it, what is your objective?

[00:20:14] Uh, do you understand the difference between your goals versus your objectives and can you write them clearly? So kind of like OKRs and, you know, uh, like objectives and key results, sort of. Um, formulating that is important, but then also thinking about how The transformer architecture is actually working.

[00:20:32] How an LLM works, you know, are you the kind of person who is trying to humanize the model? Or are you, do you actually know how, um, the inner workings of an LLM are taking place so that you can use the right kind of words and instructions and tokens, um, that, uh, that enable the large language model to actually do what you want it to do.

[00:20:55] Um, and, and there are several studies, right? Like, um, I think even from Google Brain, there are like these studies on the kind of prompts that you should have, the kind of style with which you should write your prompts. Apparently, research has shown that, you know, if you give a lot of like emotional sort of prompting, like this is this answering, this is very important to me, or this is very critical, things like that, when you start adding like emotional aspects, it seems to be the LLM.

[00:21:23] Responses are better aligned with what you expected to respond, which is interesting. Uh, so there are a bunch of prompt styles that you can sort of use and that's where this prompt library part comes in. Um, here is an example of, uh, just my experience, uh, how it started where, you know, I was doing this.

[00:21:42] Chatbot building and I use this prompt of like use the below documentation from Fiddler to answer the subsequent question Um, if the answer cannot be found and just say I could not find an answer So this is how it started and then I started iteratively like looking at the performance and the responses I realized that okay.

[00:22:00] I wanted certain references Uh, we are using uh, readme for uh, Fiddler documentation. And so there are certain styles with which the links are created. I had to sort of write that in the prompt about how to generate references, uh, what it shouldn't do, uh, how to, you know, provide links for our Slack. For further, you know, clarifications, making sure that it's concise and it's not making up answers and things like that.

[00:22:27] Um, so it's, uh, so there was a lot of iterative aspect that took place in this brown building. It's important for you to, uh, really... Go through all these iterations. Don't be, uh, trying to optimize for the best prompt. Uh, try to just start with a prompt and start building it from there and see what happens there.

[00:22:49] Um, let's see if there are any questions. So, uh, why can't, uh, we probably extend vector database features to acquire similar content? Um, I think, uh, we can, uh, I hope I have not, like, uh, said something that, uh, was confusing. Um, with LangChain, will the chatbot answer only from your datastore, or will it go back to the underlying LLM turbo's learning to answer a question?

[00:23:23] Um, do you have any control to see if this is happening or is it a hallucination when you see an answer that is not in your documentation? So the, this is a great, uh, question actually. Um, so there is always this, uh, this worry, right? Whether, uh, the LLM is using its own knowledge or whether it's just constraining itself to, um.

[00:23:45] Yeah, the just the documentation that you have provided. And, uh, the prompt here, uh, definitely matters. You have covered actually a lot of questions here, uh, including hallucination, right? Uh, so, um, there are, we will talk about the topic of hallucination very soon, but, um, I think you're prompting and how tightly you sort of give your prompt, uh, as to, you know, do not make up answers, just Just look at this documentation and answer.

[00:24:14] Things like that are going to be important, uh, for you to realize that whatever docs you're providing, uh, it is, it is using those docs and not its own knowledge. Uh, but it touches the concept of hallucination. Uh, we'll talk about that soon enough. Uh, but thanks for the question. It was a great question.

[00:24:33] Um, let's go to, I'll, I'll cover that, uh, it's, it's coming up. Um, but let's go to the next, uh, lesson, which is human feedback design. I think this one is obvious, but it's still important to, uh, say that, um, Having human feedback, that is your user feedback, is important and the modalities with which you want to create options are also important.

[00:24:59] And what do I mean by that is that make sure that you are having more than one way in which your user can express whether they are happy with the response or not. And by that what I mean is that, you know, um, If you look at cognitive load, if your user is unhappy with the response or they think that something is outrageous, they will, uh, be willing to, um, uh, spend a lot more cognitive load in explaining because they're frustrated.

[00:25:26] And so they will want to, uh, you know, leave a comment or feedback saying that this is ridiculous or like they'll be outraged about it. If the response is good, they may not. They're less likely to, like, give you a text feedback as to why the response is good. They'll just leverage that knowledge and then go about doing that work.

[00:25:47] So that's where, you know, things such as thumbs up, thumbs down kind of helps, where there's a lesser cognitive load for the person to just click that this is a good feedback or not. These, these things, understanding these behaviors about users is important so that you can have multiple modalities with which you are, um, you know, designing your, uh, feedback.

[00:26:06] Um, mechanisms. So that's just something that I wanted to make sure that it's there in these lessons. Um, and then, um, the next lesson is about not storing responses and queries, but on top of that, making sure that you are converting your responses and queries into embeddings. Um, and the reason is because Let's say if a user is asking some question, you could develop a feature that is, you know, maybe you meant this sort of a feature, where given the question or the query by the user, you might start suggesting the user better questions to ask or related questions to ask.

[00:26:44] And the way in which you can do that is if you have a repository already of questions that, you know, you think are relevant to the... context in which you're developing your LLM, uh, you could actually go back to your data, look at all the questions that are being asked and the queries that are being generated, and try to do search within that space as well.

[00:27:05] So here, obviously, I've not implemented this in the Uh, Fiddler, uh, ChatBot, because we're still collecting data on relevant queries and I can generate synthetic data, for sure. But it's not there in ChatBot, but it's there in the Bing Bot, so I just wanted to show that here. Where, you know, I'm asking about the weather in Palo Alto.

[00:27:23] And so it's suggesting questions like the weather in San Francisco or will it rain tomorrow? And these sort of functionalities of like nudging the user to create better questions or maybe answer or explore or keep them engaged on your app might be something that's important for you to explore. So make sure that you're not only storing just the raw responses and queries, but maybe converting them into embeddings.

[00:27:47] Um, with this, let's move to this next lesson, which is, I think, quite an interesting topic for everyone, which is detecting hallucinations, making sure that you're reducing the hallucinations in your chatbot or in your LLM app. Um, and, um, there's a question here, can you explain the process of integrating external tools into the RAG process?

[00:28:12] Um, for example, handling numerical data for generative analysis. Alright, I'll take this a little later. Uh, let's, uh, talk about this, uh, topic, which is, um, hallucination reduction. So here what I'll talk about is how, uh, as I was developing the bot, obviously, uh, Fiddler Chatbot was also generating certain hallucinations.

[00:28:31] Um, there is some workflow that I developed. as it was, you know, generating hallucinations and at Fiddler right now, we are obviously working on developing tools that are more automated in nature that can detect hallucination and help you reduce them. But here is, you know, just a sneak peek of like manually how you might be able to do it.

[00:28:50] So let's say like you notice that your chatbot is hallucinating and you want to quickly just be able to resolve that. So here is an example from Fiddler chatbot where I'm asking the question of, could LLM be an ML model? And so here it says that no, local linear model and ML, which is machine learning, are not the same things.

[00:29:09] So it is doing one type of hallucination, which is this abbreviation hallucination, where, you know, you are meaning something else, but it's interpreting it in some, some, some other way. Um, so you realize that, oh, maybe the bot doesn't have enough context in the documentation itself of like what LLMs are.

[00:29:27] And so when I actually started like, you know, uh, looking at this query and the response, I realized that the docs did not have enough context, uh, about LLMs. And so then I started adding these things, which I call it as caveats. So caveats are just this extra information where, which might not be in your database in a very obvious way, even though it might be obvious to you or to your enterprise, uh, with your expert knowledge.

[00:29:54] That, uh, what, what you're trying to say is like kind of obvious in this context, you may want to go ahead and start appending that information in your data database. Uh, and, and then, uh, when I started appending this information, uh, you can see that once I asked. So added the caveat about what LLMs are, and I picked up the blog by, you know, our head of product, Sri, and I converted that blog.

[00:30:18] So that's actually not in our documentation, right? It's just an extra piece of information from our blog, and I added that information. And now you can see that, you know, when you ask the same question, it is able to answer that. Yes. In fact, the answer flips, right? That LLMs are a type of ML model, and it gives you a better answer there.

[00:30:37] So here is a better. You know, sneak peek into like how manually I was doing it initially. So you can see that, you know, there are these caveats, which are Fiddler specific information. So like, you know, we talked about that ModelInfo object. So there are certain fields in the ModelInfo object that can be updated, but that information is not very directly given in the documentation.

[00:30:59] So I started adding these. And then once I, you know, added those caveats, converted them into embeddings and just appended them into the database, and you could see that the reduction in hallucination was like pretty obvious. Um, so obviously you will start thinking about, okay, how can you, uh, automate this process of discovering topics on which hallucinations are occurring?

[00:31:23] And for that, how do you actually firstly detect Uh, whether this is a hallucinated response and so on. These are questions that, uh, still do not have a very clear answer. Everybody is working towards it. We at Fiddler have developed, uh, our responses and strategies to some types of hallucinations and how we can address them.

[00:31:43] Uh, so, you know, uh, stay tuned for all of that stuff, uh, to come up. But here is a snapshot of how I was manually adding them, uh, for the Fiddler, um, chatbot. Uh, next, uh, we are going to go into this notion of trust, uh, with the bot. Oh, we have a poll here, uh, it'll be nice, uh, to see, uh, who is using RAG based chatbots, so please do answer this poll.

[00:32:10] Um, So the next lesson is related to building trust with your chatbot and how you should, as a designer of the chatbot, uh, make sure that your chatbot is designed in a way that it enables trust with the user. And here, UI UX, the user interface and the user experience is extremely important for the user to feel That they can trust this application that you have developed.

[00:32:36] So here is this, you know, gif that I created with the Fiddler chatbot where, you know, the user is answering, uh, entering the question and when they press enter. The way the chatbot is designed is that it's streaming responses from the LLM and designing the sort of streaming, um, ability is, it may be trivial to some who are already, like technically, you know, there, uh, to develop these applications.

[00:33:00] But for those of you who are starting out. Uh, having this sort of streaming responses, okay, this is interesting, uh, the poll results are here. Where is your organization using RADChatbots? We have yes. as around 30 35%. Maybe if we can figure out how to avoid hallucinations is the highest response. And then no is also around 30%.

[00:33:24] So it's like, it kind of is a uniform distribution across, uh, folks. Um, so that's, that's good to sort of know, uh, who all are there here. Uh, but going back to this, uh, building trust thing, you can see that, uh, you know, when the chatbot is streaming the response, it's not like the application is hung and it's waiting for the entire response from the LLM before it shows you.

[00:33:47] And these kind of little features are important. Um, it's kind of like, you know, opposite of what these airlines do when you're for, uh, tickets where it puts like a little circle. Um, it could. And that would actually like give you a response very quickly. But sometimes a delay, uh, actually helps the user think that Okay, you know, the website is working hard to, uh, kind of find a good flight for me.

[00:34:10] Uh, here it's a little opposite where you know, you don't want your, um, um, app to hang. Uh, and that the user might think that, okay, you know, this is not a good app and might not trust your responses. So these are things that are important to consider. Uh, and then the final lesson is this need of having memory and summarization capabilities.

[00:34:33] So you know, your users will ask questions such as this, that, it, when, what, uh, and because they're, they're going to, you know, talk about a topic and then refer to that topic with this. And so on. And so you need to have this concept of memory and summarization, uh, where, you know, you're using yet another LLM to, you know, come up with a summary of the conversation and input that in your, in your RAG style chatbot.

[00:34:58] Um, so that there is enough context of what the user is asking. So here is an example where, you know, I'm asking a question of how can I use them? And the context of this question is that I was actually asking about dashboards, which is a Fiddler functionality. And so there's, there is a chat history.

[00:35:14] Which is being parsed where, you know, the user has asked questions about the dashboard and all of this can be done in LangChain itself. But it's important to consider these things because when the user gets used to actually having a chat with your chatbot, you need to also consider these aspects of memory and summarization baked into.

[00:35:33] So, with this said, here is a summary of all these lessons. These are obviously things that I experienced when I was designing the Fiddler chatbot. I'm sure that there are lots of other topics within this RAG architecture that have questions that may not have answers yet. And so on. But we started with this idea of what framework to start with, and then how to process the queries, what kind of prompts to generate, how to retrieve information, how to actually create the right kind of instructions, how to incorporate human feedback, what information should you store and how you should store it, how can you deal with this notion of hallucination, at least in a manual way, if not, um, um, have tools already for it.

[00:36:14] Um, building trust with the user and then, you know, memory. So with that said, uh, I will end the conversation here in terms of, uh, uh, ending it with a thank you slide, but I'm here for, um, chatting with the audience. So thank you.

[00:36:30] Karen He: Thank you so much, Moto, for, uh, Presenting your experience on the chatbot. If you haven't had the chance, you can go in and, you know, ask any questions to the Fiddler chatbot.

[00:36:44] I've included the link on the chat and also in the resources section. Let me go through. There's definitely some questions that came in, and I know you asked some of them already. So first off, one of the Audiences, they ask, you know, how do you handle hallucination?

[00:37:08] Murtuza Shergadwala: Yeah, so, um, you know, that's definitely a topic of interest for everyone.

[00:37:14] Um, hallucination, obviously, is a response by the bot that seemingly looks like it is, uh, it is convincingly saying something that is maybe not true or not accurate or it is not, um, you know, um, truthful. Um, and there are different types of hallucinations. A hallucination is just one word, but, um, it's, it's an umbrella term for a bunch of things that could happen.

[00:37:40] So it's firstly important to sort of start framing your, uh, framing your tools, uh, that are there to, you know, sort of reduce hallucination. Uh, in a way where you are understanding what kind of hallucinations, what types of hallucinations do you care about and um, what are frequently happening in your specific application.

[00:38:02] So do you have like a lot of abbreviations? Is it like a, you know, chemistry sort of an application where there are lots of these like, you know, chemical elements or something like that and it's trying to Change the names of these elements, it's incorrectly using certain abbreviations or something like that.

[00:38:20] Is it a factual bot, especially like, you know, we saw Amal gave an example earlier on Fiddler, in Fiddler Auditor about, you know, the kind of fees that a hypothetical bank was charging and is your bot going to come up with some number for data analysis, it's of course important that numerical input is treated correctly. So there are different things that can happen, uh, within the, uh, uh, whole, this vast space of hallucinations, and it's important, uh, To appreciate the fact that when you're asking how to address hallucinations, first you might want to, you know, think about what kind of hallucinations, uh, are, are, are frequently occurring or have a chance of occurring in your, in your application before you start trying to address how to solve it.

[00:39:09] Um, and at least for now, I don't think there is any tool that addresses hallucinations as a whole, wide, broad topic completely. It might do bits and parts of, uh, things, like... There are tools already which are looking at the responses and trying to compare that response to the docs that you are sharing in your RAG style architecture and trying to compare them, um, to see if that information matches whatever is being presented in the documents.

[00:39:39] That, I think, is a great start to just, you know, do some checking, uh, be it on a token wise checking or a sentence wise checking or something like that. Topical, uh, you know, extractions and things like that. Um, but when you start going into these, uh, more complex situations of, like, factual, um, you know, factual knowledge, uh, that may not, uh, be something that's directly retrieved from the docs, it may be a combination.

[00:40:09] That's where things get dicier and still an open Uh, research topic there. Uh.

[00:40:15] Karen He: The next question, uh, can you explain the process of integrating external tools into the RAG process? For example, handling numerical data for gen generative, uh, analytics?

[00:40:27] Murtuza Shergadwala: Sure, uh, so, uh, here you're giving an example of handling numeric data, I myself obviously have developed this Fiddler chatbot where, you know, there might be, uh, it's still easier to just develop a chatbot which is asking questions on documentation rather than doing some generative things.

[00:40:44] Analytics. And I don't know, uh, completely, I'm not well versed with all kinds of mathematical errors that you could actually encounter with those kind of, uh, bots. But when you think about integrating tools, uh, this canonical architecture is very helpful to me of like looking at your RAG architecture and seeing All the different plugins that you can have at each function level, um, and, uh, you know, questions such as, you know, which vector database to use, or questions such as, uh, how am I pre processing the, the prompts, or.

[00:41:20] Am I, you know, sampling things from library? Am I searching things from the internet? Where am I retrieving my information from? What kind of metrics am I using for information retrieval? Like in my case, I just used a very simple cosine similarity way of retrieving docs, but there's obviously a lot more complex ways in which you can rank and retrieve.

[00:41:41] Pages that are relevant, what are you using for those kind of tools, um, those are things that you'll have to consider, um, plus you'll have to see, um, how much control do you have over the architecture, um, on which you are developing your chatbot, because that'll help you figure out whether you can integrate certain external tools or not.

[00:42:01] Karen He: All right, there are a few more questions, uh, may be able to get through two. How do you measure success of your chatbot that is actually helping your users?

[00:42:12] Murtuza Shergadwala: Yes, this is a great question. So I think here, success of a chatbot is definitely connected to, you know, user experience, but also whether it is, you know, giving the right kind of information to your users.

[00:42:26] So getting feedback from your users and trying to Uh, quantify success through that, uh, obviously usage of your chatbot. If people are not finding it useful at all, or if it is causing more problems than actually, uh, giving you solutions, then you might consider, uh, reconsider why you're designing this app.

[00:42:47] So success would have like multiple dimensions there, but it's, I, I would connect it to, you know, make it user centric, uh, and measure success that way.

[00:42:57] Karen He: All right. Um, I believe that is, uh, all the questions that we can get to. So, you know, we'll, with that, we conclude the AI Forward Summit. It was a pleasure hosting each and every one of you, and thank you, Moto, for your awesome chat on chatbots.

‍