Enterprise generative AI needs to be reproducible, scalable, and responsible, while minimizing risks. Ali Arsanjani, PhD, Head of the AI Center of Excellence at Google, explains why this requires an augmentation of the ML lifecycle.
Watch this session on Enterprise Generative AI - Promises vs Compromises to learn:
The importance of explainability and adaptability for enterprise generative AI
How to minimize risks to safety, misuse, and model robustness
Key elements of the generative AI lifecycle
Mary Reagan (00:07):
So our next talk is by Ali Arsanjani, who's the Director of the Cloud Partner Engineering at Google Cloud, which, and he's also the Head of AI and the ML Center of Excellence. He specializes both in AI and in data analytics. He tends to focus on building strategic partnerships and co-innovation with large partners. Previously, he's held key roles at AWS, where he was the Chief Principal Architect and at IBM, where he was the CTO of Analytics and Machine Learning. Dr. Ali Arsanjani is also an adjunct professor of computer science, and he teaches machine learning at San Jose State University. Previously to that, he taught at UC San Diego. and Ali. Hello. Welcome. with that, I'm gonna turn it over to you
Ali Arsanjani (01:00):
Mary Reagan (01:02):
Ali Arsanjani (01:03):
Great to be here. And enjoyed the panel just now and some of the rest of the talks. Very beautiful. I'm going to share my screen here so that we can kinda get started. Very good. And alright, so hopefully you can see the presentation here.
Mary Reagan (01:36):
Yes, I can.
Ali Arsanjani (01:37):
Very good. Excellent. So we're gonna be talking today about responsible generative AI in the context of the adoption in the enterprise and the promises and challenges associated with that. We're gonna look at the evolution of generative AI and then look at the LLM and foundation models background, some research done at Google, and how we can proceed forward in this domain. So in terms of the introduction for generative AI we'd like to make a distinction between predictive and generative AI. So that in predictive AI, we typically make classifications, clustering, regression, forecasting, types of prognostication, statistically speaking, or using deep learning models. Typically, we train on historical data and then we apply that data to new situations. And the training data is often very unique and specific to the task, to the model that it's gonna be used for.
And we generally require supervised learning labels in terms of input output pairs, and basically pre-training is not commonly used in AI. In contrast, in generative AI, we're generating content. We're generating content based on patterns that we've learned from the vast array of training data. And these can be multimodal. And the training data is typically non-specific, non-unique data. It's very general Wikipedia, common crawl, you know, types of situations where you do want to generate out of distribution information. Probabilistically, predicting the next word, the next token, and this is generally unlabeled data. The way we employ these in terms of generative AI is through an engineered prompt. Whereas in predictive AI, we're using a function approximator. So we give it an input and an output and train on it, and the machine learning model needs to understand what the function would have been, could have been.
And in that sense is a function approximator, so it predicts the label. Whereas in generative AI, we are creating new data. And with that I'm kind of going back in history back in 2017 when Google Brain released the "Attention is All You Need" paper, which kind of was the pioneer in this area of the encoder decor architecture with multi-headed attention that we all know and enjoy today in all these various types of models that have evolved from it with the first use case being language translation. And then of course at Google you know, there's been a continuous thread of building different types of models culminating in models for conversational AI. And we've you know, the backend of many of the applications and products and Google search that we all use and love or backed by these large language models and our researchers continue to develop these models over time.
And as you can see, there's a very long you know, list of different types of models like the T5 model, which was open source back in 2019, probably forms the cornerstone of many of these other models that have been developed since. And then LaMDA was a model that we built for dialogue based applications very specifically. And then we, you know, we've moved to various versions of the PaLM, the Pathways language model, which is about 540 billion parameters dense decoder only model. And moving into 2023 with Bard powered by LaMDA. In 2021 researchers at Stanford about you know many, many researchers here wrote a 200 document, which basically summarized the opportunities and risks of foundation models. And I think this is something that we should all take a look at and study.
They're looking at examples of the key papers that were out there, kind of extrapolating from that, deriving conclusions from that. And as we know, foundation models themselves are of two types more geared towards representational models where data and the patterns underlying it are derived, but not so good in terms of generating new content. Whereas in the generational view of foundation models, we don't really understand the structure of the underlying data as much, but we're more prone to generating content from that foundation model. So in training these foundation models, often as balance needs to be struck between these two, if you will, polar opposites of building foundation models. And then one of the key aspects is when you build a foundation model, you don't want to necessarily put it out in the wild. You can, you know, put an interface, very nice interface, very user-friendly, consumer-based interface in front of it and it will generate, it will hallucinate, it will provide, you know, things that are not necessarily accurate but probabilistically viable.
So it's necessary to provide an adaptation of these foundation models before the actual use, before the actual downstream task is performed. A lot of companies nowadays, including Google, are putting these adaptations in to safeguard against toxicity and safety issues or even hallucination issues that come out of these models naturally. And obviously the iconic example of these models are, you know, diffusion models which generate images from text. You know some of them are conceptual understanding, which exhibit basically distinguishing between cause and effect, understanding conceptual combinations in an appropriate context. Again, context is king. And then of course, chain of thought prompting which allows us to unlock capabilities of large language models to talk to, be able to tackle complex problems by breaking them down and providing a rationale for why a result was obtained.
I think one of the key things in moving forward in research in terms of these large language model outputs is to basically have a path, a traceable path back to how that was generated. So in terms of explainability, if we can have a verification and initially have the model explain why it arrived at a certain conclusion, you can then break that down into more fundamental parts and figure out if one or two of the parts are faulty. So there's a greater ability to essentially monitor and look at the providence and look at the background of these models. So speaking of these models, just talking about two main of these two main models here. One is the pathways language models which is able to generate thousands and thousands of downstream tasks, question answering tasks, code completion you know, language understanding, summarization, et cetera.
And the key here is that the model itself is not necessarily fully connected. There is a pathway through the model that is essentially traversed, and so therefore you don't have to have the entirety of the model you know, firing, if you will. At the same time. In contrast to that, LaMDA was a model that Google built in back in 2020, the language model for dialogue based applications. Now, in addition to generating output, it's able to produce classifiers such as, for example, the degree of safety, the degree of sensibility, the degree of specificity, or relevance of the answer or the response that it's given. Interestingly enough, as we see, you know based effectively on the emergent nature of these language models, meaning that there is not a linear relationship between the size of the model and the number of things it can do.
In fact, the emergent behavior of these models indicate that as you build larger and larger models, there are quantum leaps that occur, so to speak. Meaning that there is potentially a different kind of a distribution, possibly an exponential distribution between the size of the model and the emergent behavior. Emergent in the sense of complex adaptive systems where the large number of parameters allows you to do many more things than a commensurate size model would do. However, these models require huge computational, you know, costs and requirements in terms of both training and inference, and basically very, very high quality data for their training. So one of the research results that were propagated through this paper in March of 2022 by DeepMind was talking about having a fixed flop budget, floating point operations per second budget, where you want to make a trade off between the model size, not build as large models as possible, however, train them even more with more tokens.
And so in this case, one model called Gopher was utilized as a basis, for example, and then another model Chinchilla was derived. And in comparison, you can see that it's a fourth smaller, yet trained with four times the number of tokens. So this indicates the quality of data, it has a material impact on what we do. And there are various, you know, deep mind models that were put out in terms of papers and in terms of open source. One in particular is interesting just to look at for, from the perspective of the conversational model Sparrow, which is a fine-tuned with human feedback model. It's an open dialogue agent trained to be more helpful, more correct, and less harmful. And so the initial impetus for that was reinforcement based human feedback there with evidence supported claims and fine-grained checks using rules that's very deterministic rules. And this allowed for the paper was improving alignment of dialogue agents via targeted human judgments. The way Sparrow was trained was, it was given a number of rules, and these rules were essentially to avoid aggression, avoid threats, harassment, toxicity, insults, et cetera, things that were generally negative. And the model was ranked, it was asked to rank itself in this regard and to rank itself in terms of preference reward modeling. And then the results of these were fed into reinforcement based human feedback and looped back into the Sparrow model. And the objective, again, was to build safer dialogue agents. So this combination, the way it was trained was very iconic in that it's not just immediate human feedback, but some kind of scoring of the model itself based on deterministic rules, which is very important because as we adopt large language models in the enterprise, we'd like to be able to throttle them.
We'd like to be able to constrain the way they operate if they're producing negative, harmful, or these types of unwanted outputs. So here in part two we're gonna explore a little bit more of the challenges. The opportunities are definitely there in terms of machine learning, the creation of data generation of content, text data, the enterprise application, you know, has been revolutionizing and super powering, accelerating the pace in which we produce content in a very effective way, whether they're product designs, marketing content, simulating virtual environments, individual consumers. We've all been enamored by the personal experiences we've had in terms of various types of images or text audio that has been produced in education and research. The ability to generate new data sets, simulate new environments, and assist in the discovery of new connections between existing knowledge has accelerated research and discovery to some extent.
And so if we look at this, the generative life cycle is something that we'd like to explore. And before we do that, you know, kind of looking at the state of the art of the models getting larger, again, I'd like to emphasize that the insight here is that the data and the data efficiency is really the key to building these conversational agents or, you know, models behind the agents. So pre-training has on the order of 1 billion examples, 1 trillion tokens. The fine-tuning has on the order of 10,000 of examples. So in contrast obviously much, much less. And then in prompting, you can get tens of examples, few-shot data awaken the superpowers within at the right time. So the pre-training, the fine-tuning, and prompting are the key insights where the data and data efficiency of training these models play a primary role.
And as we know, prompting really awakens the superpowers within that can give us chain of thought reasoning, prompting to elicit very specific feedback that we're looking for. In this next part, I'd like to go a little bit into detail about the generative AI lifecycle. There's a, if you scan this it'll give you a URL to a blog that if you wanna know more about this there's more data there. So in the traditional machine learning lifecycle, which we all know and love, we have these various loops of collecting data, training models, and deploying them and monitoring them and of course, with the experimentation lifecycle. But in this era of large models you know, transfer learning is very important. Pre-training has become a thing. And so as we utilize these mechanisms, it's important for us to understand that at the enterprise level when we're trying to adopt large language models, the emphasis on the traditional lifecycle may be inadequate, essentially.
And if we wanna do iterative adaptation for model deployment, we need to consider a factor such as AI safety, misuse mitigation, and the robustness of the models, and to basically update our model management techniques whether it's data curation, model meta-data, and or the monitoring of these content for adaptation drift. Meaning not just the model drift, the data drift, but let's say you've mitigated a model but then the model starts to drift. It's not necessarily that the inference has drifted, it's the mitigation of the toxicity, let's just say, that has drifted. So the adaptation drift is also another type of drift that we should be starting to monitor at the enterprise level. So it's not just skew of data and model, but the adaptation that we've provided it. And this can be exemplified in the model cards and the data that is associated with them.
So if we look at the lifecycle very broadly we have the data pipeline, the data preparation we have the ability to essentially look at training and experimentation. And then ultimately this key thing that I put smack in the center, which is the adaptation of the model. A new element that is critically important to augment our traditional model, including prompt engineering. So these two are the paramount new ones here. Of course there are impacts to the other elements. Here we can see the repositories that are involved in where we're at. These could be our data lakes or lake houses, feature stores, model registries, known biases regarding responsible AI, potential model garden, model hub of foundation models. And down here, prompt example databases that can be used for prompt engineering.
If we start with the whole notion of the data pipeline, we can see that you know beyond the traditional mechanisms that we use for data collection and preparation, and in fact, programmatic data labeling becomes an important part of this. In terms of engineering the features, if there is feature engineering involved, selection of the actual LLM itself becomes an important part. So imagine you have a large model garden, which one are you gonna select? That selection process needs to become automated, or at least parameters for automation of that process need to be taken into consideration. Depending on the data you have, the specific domain data, let's say it's in FinTech, or let's say it's in insurance, what are the downstream tasks that you're vying for? And in fact, what are the safety, privacy bias kind of constraints on the data even before training that need to be taken into consideration before something is actually curated and put it in a feature store.
From there, yes we do experiments. We fiddle with the experiments, pun intended. We evaluate the models, we change the parameters, retrain, et cetera, or we allow an auto ML capability to do so for us. If we're using a pre-trained model, we need to decide whether we should use that model as is or whether we should do something else. And in that case you know, going beyond the data collection, creation, curation, and feature storage capabilities and the tasks of determining whether to pre-train the model or utilize a pre-train model, whether to fine-tune it or just pluck it out of the registry and use it. And these elements depend on how you would like to train, whether it's data parallel, model parallel, or federated training that you conduct on your models. You essentially need to adapt these models. This is kind of the crux of the message here.
The adaptation can be with the fine-tuning, but it's not just the fine-tuning or prompt-tuning of these models. It's potentially the distillation of these models into smaller elements. And then, of course, these privacy, security, and generally AI safety and misuse considerations that we would have. So in terms of these transfers, if you will model adaptation tasks, we're gonna focus on the AI safety element at this point. So looking at the issues of security I'd like to say that you know, LLMs could potentially look as if it's an operating system, have that same role as an operating system in traditional software where they could be potentially, if they're governing a lot of interactions a single point of failure due to the generality ubiquity. The foundation model could potentially be that there's the issue of data poisoning, where permissive data collection and labeling could allow injection of poison into the model for such as, for example, injecting hateful speech targeted at a specific company, at a specific individual, and a specific target.
And it could be exacerbated by the growing size of the content that these models produce. So imagine the large language model creating content again and again, So they can if it's bad content, it'll just amplify that negative content. There is a potential for function creep and dual usage whereas in the model card, let's say, there was never this is an example that was done in a research paper where the model card didn't explicitly mention facial recognition and other surveillance technologies that were deemed out of scope. But CLIP, for example was could be repurposed for such tasks as research in 2021 has shown. So basically the initial task was just to predict image text pairs, but in order to predict image text pairs, the model actually had to capture and learn rich facial features. So even though the model card didn't explicitly describe the intent of the authors, the model itself was essentially biased towards learning those facial features because it would contribute to the better fulfillment of its task, which was image, text, pair, prediction.
So these types of things such as the multimodal inconsistencies could be critically important. For example, the famous example that I use is typically where you have an autonomous driving system that relies primarily on vision. It sees a billboard and that billboard has the word green written on it. So does it, or should it interpret that as a green light? What if it does? And it's not supposed to interpret that sign as a green light. There are other concerns around robustness to distribution shifts in the real world where the test distribution differs from the training distribution. This may seem obvious to people, yeah, don't mix the training and test, but in fact, there's research that shows that this essentially poisoning of the training data with the test data and vice versa. The leakage of this data has actually occurred inadvertently in many, many of the experiences that people have had in using ML in life sciences.
And so these types of distribution shifts can actually cause large drops in performance even in some of these state-of-the-art models. So foundation models trained on large and diverse unlabeled data sets that can be adapted to downstream tasks need to be checked for their robustness. The AI safety issues are not just about fine-tuning the model. There is a control problem where the you need to be able to look at an advanced AI system and allow there to be kind of a fail safe with humans in sufficient control of the deployed system so that let's say a impact on financial markets doesn't occur or impact on factories doesn't occur, or in socio-technological legislation, measures and institutions that have safety algorithms that are actually operationalized where unsafe systems could potentially be deployed inadvertently.
Some of these risks have to do with carrying out of goals, for example, engagement. If that's an objective of a social media platform, if that's the sole objective given as a target to the language or the large model, then it will overemphasize and over-index on engagement versus other factors. So if engagement and profit on social media platform is a paramount objective for that model, then a lot of other safety issues may come in secondary. And so that is a very important problem to solve. Beyond that there is the issues pertaining to catastrophic robustness, for example, where how models behave in the face of unexpected ways in which they behave in contrast to new kinds of data that they experience that they've never experienced before that could potentially lead if they are controlling certain systems that have human life implications or human harm implications.
These could have catastrophic implications when you are actually seeing out of distribution data. Mis-specified goals is another risk of optimizing misaligned yet easy to specify goals. The short-term goals, low-hanging fruit, I gave the example of engagement. The negative effect of some recommender systems may be polarization and media addiction. If the engagement is solely the objective as specified by the iconic Goodhart's Law where the risk of optimizing misaligned yet easy-to-specify goals, short term goals versus longer term more broad humanistic goals could become an issue. The mitigation of misuse in terms of the quality of content like deep fakes, the cost of the content creation, as the barrier is lowered in terms of content creation, malicious actors can more easily carry out harmful attacks, and the personalization reduces the obstacles to creating personalized content where social media posts that were crafted, let's say hypothetically to push a certain narrative could in a certain narrative that may be aligned with a political objective, may be co-opted and utilized in another set of political sociopolitical scenarios that research has shown has actually been conducted.
And the idea that misinformation and disinformation propagation can be essentially amplified through the use of language generators that generate misinformed and disinformed content, whether they are fake profiles, abusive content and news that are generated. Insufficient quantity could be highly contagious, if you will, on the political scene. So on the positive side adaptation is not just about fine-tuning, foundation models are very powerful but they are also powerful potential detectors of harmful content. So we can potentially utilize them against the production of harmful content, and we can rethink human interventions by essentially allowing human interventions once the foundation models have basically attacked or defended against malicious practices whether they're on social media or other kinds of plagiarized content or disinform misinformed kind of content. So the net of this is foundation models as detectors within the enterprise are a key area in which enterprises can start looking at this as they adapt essentially these models for deployment. And with that I think we have arrived at the conclusion time-wise here, and I'm more than happy to take questions.
Mary Reagan (30:46):
Yeah, thank you so much. I learn so much every time I hear a talk from you. I must say. I'm gonna start with this. We only have time maybe for one quick question. This is from Ram, who says, do you believe that the democratization of LLMs is on the horizon, especially in an enterprise setting, generic LLMs, like GTP-4, maybe overkill, but an LLM that is built to a particular application/data is more useful?
Ali Arsanjani (31:12):
I would agree that in fact if you think of an ensemble model where you have multiple smaller models and each smaller model is cheaper, it's more focused, it's more controllable, and the objective functions can be refined to an acceptable domain. And in such in doing such a kind of an enterprise, you can then pick and choose which foundation model you utilize or which fine-tuned prompt-tuned model you use. And those smaller models may be training wheels that we as an industry may need to put on the enterprise LLMs that we adopt, so that we're not opening ourselves to risk on large language models immediately other than for very creative things that are very constrained. So I would agree absolutely with that supposition.
Mary Reagan (32:09):
Wonderful. Well, thank you so much Dr. Ali Arsanjani for being with us. And I just wanna say there was an anonymous comment who said fantastic talk. Thank you, Ali.