Serverless RAG with Gradio, MLflow Tracing, and Databricks

Nov 21, 2024

This post delivers a Retrieval Augmented Generation (RAG) app using Gradio, MLflow Tracing, and Databricks Mosaic AI to build a quick chat interface for testing and rapid customer feedback, with support for streaming and citations from a vector database. It extends the app built on the Databricks Demo “Build High-Quality RAG Apps with Mosaic AI Agent Framework and Agent Evaluation, Model Serving, and Vector Search.”

Let’s cover how I’ve added support for streaming and citations, in addition to hosting on Databricks Apps.

The Gradio app code is available on my github: https://github.com/eumarassis/rag_app_gradio_dbx

The Gradio App

Gradio is a convenient framework to demo machine learning models with a friendly web interface. Gradio is not a replacement for a fully-fledged chat experience built on robust UX frameworks like React or native app development. However, it is a great tool for early demonstration and to prove the functionality of the model vs. focusing on building a pixel-perfect interface from the beginning.

Customizing Gradio can be tricky, but basic CSS and HTML5 skills are all you need. In this demo, I wanted to show a modal popup to render the citation based on content retrieved from the Vector Database, Databricks Vector Search. So, all I needed was a mix of Python code to render HTML content and CSS. Example below:

json_databricks_response = extract_databricks_output(accumulated_response)

references_text, urls = generate_reference_section(json_databricks_response)

sup_links = ''.join([f'<sup><a href="{url}" target="_blank">{i + 1}</a></sup>' for i, url in enumerate(urls)])

references_section_html = f"""
    <div style="display:none" id="references-section-div-hidden">{references_text}</div>
    <p id="references-container">
        <span id="references-section-title" onclick="openModal();" title="View Reference Sources">VIEW SOURCES</span>
        <span id="sup-links">{sup_links}</span>
    </p>
    """

The code for the extract_databricks_output and generate_reference_section functions above is provided in the dbx_endpoint_util.py file, which essentially parses the MLflow tracing response.

Adding streaming support was also straightforward, provided the model endpoint supports it. Databricks Model Serving and AI Gateway offer full support for tracing with MLflow and streaming by passing these headers (ds_dict):

def call_model_stream(messages: List[Message], endpoint_name: str, token: str) -&gt; Generator[str, None, None]:
    headers = {'Authorization': f'Bearer {token}', 'Content-Type': 'application/json'}  # Set headers

    ds_dict = {'stream': True, 'messages': messages, 'databricks_options': {'return_trace': True}}  # Prepare payload

    data_json = json.dumps(ds_dict, allow_nan=True)  # Serialize data to JSON

    response = requests.post(url=endpoint_name, headers=headers, data=data_json, stream=True)  # Send streaming POST request

The Gradio backend implementation to support streaming is a basic Python function returning a Generator type with yield history, message, as shown below:

def respond(message: str, history: List[Tuple[str, str]]) -&gt; Generator[Tuple[List[Tuple[str, str]], str], None, None]:

    if len(message.strip()) == 0:
        yield history, "ERROR: The question should not be empty"  # Return error if the message is empty

    try:
        messages = []  # Initialize messages list
        if history:
            for human, assistant in history:
                messages.append({"content": human, "role": "user"})  # Add user messages to messages list
                messages.append({"content": assistant, "role": "assistant"})  # Add assistant messages to messages list

        messages.append({"content": message, "role": "user"})  # Add the latest user message
        history.append((message, ""))  # Append the new message to history

        response_stream = call_model_stream(messages=messages, endpoint_name=endpoint_name, token=token)  
        accumulated_response = ""
        accumulated_content = ""

        for chunk in response_stream:
            accumulated_response += chunk  # Append the chunk to the accumulated response
            content = parse_chunk(chunk)  # Parse the chunk for content
            if content:
                accumulated_content += content  # Append content to accumulated content
                history[-1] = (message, accumulated_content)  # Update chat history
                yield history, ""  # Yield updated chat history and reset the input box

MLFlow Tracing

MLflow Tracing is a feature that enhances observability in LLM-powered applications by capturing detailed information about the execution of your application’s services. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors. MLflow Tracing provides fully automated integrations with various GenAI libraries such as LangChain, OpenAI, LlamaIndex, DSPy, AutoGen.

MLflow Tracing works with Databricks and open source implementations. As mentioned above, to return the MLflow tracing in the response from a model served with Databricks, all we need is to provide the 'databricks_options': {'return_trace': True} in the header.

The code below shows how to parse this response. Note that this response may change at any time as it is not documented; I’ll update this code once I implement a “more robust” approach.

def extract_databricks_output(string: str) ->; Optional[dict]:
    pattern = r'"databricks_output":\s*({.*?})\s*data:\s*\[DONE'  # Regex pattern to extract output
    match = re.search(pattern, string, re.DOTALL)  # Search for pattern in the string
    if match:
        databricks_output_str = match.group(1)[:-1]  # Get matched output
        databricks_output_json = json.loads(databricks_output_str)  # Convert to JSON
        return databricks_output_json

def generate_reference_section(response: dict) ->; Tuple[str, List[str]]:
    references = ""
    urls = []    
    for item in response['trace']['data']['spans']:
        if item['name'] == 'VectorStoreRetriever':
            string_references = json.loads(item['attributes']['mlflow.spanOutputs'])  # Extract reference data

            count = 1
            for ref in string_references:
                
                url = ref['metadata']['url'][1:] if ref['metadata']['url'].startswith('/') else ref['metadata']['url']
                
                urls.append(url)
                references += f"""**{count}. {url}** <br />
                                {ref['page_content']} <br />
                            """
                count += 1
            
            return references, urls
    
    return "", []

Databricks Mosaic AI and Hosting on Lakehouse Apps

The details on how to build a RAG model from scratch in Databricks are covered in detail here: Mosaic AI Agent Framework and Agent Evaluation, Model Serving, and Vector Search. This end-to-end tutorial covers how to build and deploy a real-time Q&A chatbot using Databricks Mosaic AI tools, including serverless capabilities and the DBRX Instruct Foundation Model for smart, context-aware responses without fine-tuning your own LLM.

Below are the steps needed to host the Gradio app mentioned above on Databricks Lakehouse Apps. This is a great tutorial for getting started with Databricks Apps. Make sure the latest version of the Databricks CLI is installed. Summary steps:

1. Clone the repo: Git clone https://github.com/eumarassis/rag_app_gradio_dbx.git

2. Change the app.yaml file to add your Databricks Model Serving endpoint and Access Token

3. Create the Databricks App: databricks apps create yourapp

4. Sync local repo with Databricks app: databricks sync --watch . /Workspace/Users/user@databricks.com/[yourapp]

5. Deploy: databricks apps deploy yourapp --source-code-path <app-path>

In conclusion …

This demo app can be a template Gradio app to test with any model deployed in Databricks or OpenAI-compliant model. You are free to change the theme and customize colors. The crux is the streaming and citation code.

But wait, why not just use the Databricks Agent Evaluation App? Of course, you can use the Databricks Agent Evaluation App to let any users who are in your Identity provider like Azure AD or Okta test your models since users are not required to be a Databricks User to access the evaluation app. However, this could be used for users outside your identity provider and can also support additional customization. As of now, the Evaluation App does not support customization.

Also, don’t forget that Databricks AI/BI Genie is a great no-code conversational experience for business teams to engage with their data through natural language without needing to build a model from scratch and can be embeddable in your app. So, consider starting with Genie and then, if needed, evolve into a custom app like Gradio or another UX framework.

AI BITS

Serverless RAG with Gradio, MLflow Tracing, and Databricks

The Gradio App

MLFlow Tracing

Databricks Mosaic AI and Hosting on Lakehouse Apps

In conclusion …

Leave a comment Cancel reply

Serverless RAG with Gradio, MLflow Tracing, and Databricks

The Gradio App

MLFlow Tracing

Databricks Mosaic AI and Hosting on Lakehouse Apps

In conclusion …

Share this:

Leave a comment Cancel reply