Back to blog home

Building Generative AI Applications for Production

Teams across industries are building generative AI applications to transform their businesses, but there are numerous technical challenges that need to be addressed before these applications can be deployed into production. Founder and CEO of BentoML, Chaoyu Yang, joined us on AI Explained to share his real-world experiences and key aspects of generative AI application development and deployment. Watch the webinar on-demand below and check out some of the key takeaways.

Open Source vs. Closed Models

When deciding between open-source LLMs like Llama 2 and commercial options like OpenAI's GPT-4, LLMOps teams must consider various factors. For example, while GPT-4 boasts impressive general performance, domain-specific tasks may benefit from finely-tuned open-source models. Data privacy and operational control are paramount, with open-source models offering self-hosting benefits, greater transparency, and reduced vendor lock-in. However, self-hosting requires in-house expertise, longer development cycles, and high initial setup costs. Teams should also factor in the different cost structures between the two approaches: commercial models often charge per token, while open-source models tie costs to hosting compute power. Scalability and efficiency, such as the ability to scale to zero during inactivity but instantly start upon request, are also crucial for optimal LLM performance and cost-effectiveness.

Leveraging existing commercial LLMs from providers like OpenAI can offer a straightforward start to building a generative AI application. But as product interest grows, open-source LLMs become more appealing due to benefits like data privacy, potential cost savings, and regulatory considerations. For specific applications like chatbots that rely on existing documentation, simpler models combined with information retrieval can suffice, making high-end models like GPT-4 unnecessary. Many successful open-source users first experiment with commercial LLMs for their advantages and to prove business value.

Before implementing an open-source LLM in a commercial context, it's essential to review the model's license, particularly concerning commercial usage, and consult with a legal team. While major providers like Microsoft and Google might offer indemnification against legal liabilities, users relying on open-source LLMs may face potential risks, such as copyright violations linked to training data. This becomes even more intricate with platforms like Hugging Face, which host multiple fine-tuned versions of models; these, often developed by individuals or small companies, can present a spectrum of uncharted legal risks.

Developing applications that can seamlessly switch between different LLMs may offer the best flexibility, with OpenAI's APIs often serving as a starting point before transitioning to open-source LLMs. But the real challenge lies in adapting prompts tailored for specific models, especially when involving fine-tuned versions. Licensing considerations must also be assessed, with BSD, Apache, and GPL emerging as commercially friendly options, as well as potential copyright claims on the training data. Ensuring adherence to licenses, especially those explicitly allowing commercial use, is pivotal in mitigating potential legal risks.

LLM Risk Management

Data privacy remains a top priority in the deployment of generative AI applications, with a special emphasis on understanding the datasets used for training. LLMs, while advanced and potent, are also unpredictable, making AI observability crucial. Issues like model hallucinations demand close scrutiny of their behavior and user engagement. Unlike traditional predictive models where some monitoring delay was acceptable, the potential risks associated with LLMs — such as producing toxic content or leaking personal data — necessitate real-time model monitoring to immediately detect and address problematic outputs before they impact the business.

Some LLMOps teams are leveraging traditional machine learning models, such as fine-tuned BERT, to classify LLM outputs, detect toxicity and analyze user queries. Validating the accuracy of images generated by LLMs presents unique challenges, especially given the lack of a singular "correct" image in many scenarios. One method involves using another model to describe the generated image and comparing the outputs. However, to truly verify if an image meets its descriptive prompt, employing object detection or identification models can be effective. While these automated approaches offer some insight, human evaluation is still most effective at capturing subtle details and determining the "best" image based on a prompt.

Validating prompts and responses against benchmarks or ground truth datasets is vital to ensure accurate outputs. LLM robustness must be measured against prompt variations and security threats, such as prompt injection attacks. To adhere to responsible AI practices, teams must conduct stress tests to identify potential model bias and PII leakages. Fine-tuning can help address inconsistencies in outputs, especially for more complex instructions in larger models, alongside continuous monitoring and validation to detect drops in model performance. While there are general LLM benchmarks, such as the multi-turn Benchmark, LMSYS leaderboard, and those provided by Hugging Face, they might not cater to the unique business needs of a specific application or domain.

Leveraging LLMs to Build Applications

Developers can leverage LLMs in various ways, from prompt engineering to fine-tuning and training proprietary LLMs. While fine-tuning is effective for domain-specific use cases, it requires in-depth expertise and can involve a complex setup. On the other hand, retrieval-augmented generation (RAG) offers transparency and control, particularly in data governance and lineage. Both RAG and fine-tuning can coexist in an LLM application, serving distinct purposes, and the frequency of fine-tuning depends on the goals set for the model, changes in source data, and the desired output format. Think of fine-tuning as a doctor's specialization and RAG as the patient's medical records.

Teams should prioritize model governance and model risk management when considering their LLM deployment options, balancing performance metrics with concerns over toxicity, PII leakage, and model robustness. The application's domain and its audience (customer-facing vs. internal) will determine how to approach these considerations.

The GPU Bottleneck

Generative AI models rely on GPU resources for inference, but securing consistent GPU availability is challenging. Given the high costs associated with GPUs, it's essential to select an appropriate cloud vendor that fits your needs, ensure efficient auto-scaling for traffic variations, and consider cost-saving strategies such as using spot instances or purchasing reserved instances. When optimizing AI workloads, it's vital to clearly define performance goals, focusing on metrics like tokens per second, initial token generation latency, and end-to-end latency for structured responses.

Operability of LLMs such as Llama-2 7B is highly dependent on GPU memory, and while it's possible to run the model on a minimal T4 GPU, using quantization can optimize this fit. However, for optimal performance in terms of latency and throughput, especially in high-demand scenarios, a more generous GPU memory allocation is necessary to accommodate inferences and required caching.

Many use cases might be better addressed with domain-specific traditional ML models instead of LLMs; smaller models, despite their limited parameter space, can be just as effective for certain tasks.

Request a demo to see how Fiddler can help your team deploy LLM applications into production.