Build Your Own Chatbot and Image Caption Generator with Google Gemini API & Streamlit

8 min readMar 16, 2024

A Comprehensive Guide to Build Chatbot and Image Caption Generator with Gemini API

Ever felt like exploiting the mind-blowing power of AI for your own projects? Look no further than Google Gemini AI! This isn’t your average AI assistant — we’re talking cutting-edge capabilities like text generation and image captioning, all wrapped up in different versions like Gemini Pro and Gemini Pro Vision.

Intrigued? Me too! So, I decided to take things a step further and build a full-fledged web app using Streamlit to tap into Gemini’s magic. ✨ The best part? This blog will guide you through the entire process, from absolute beginner to web app extraordinaire!

Let’s build something amazing together! ️✨

Video Tutorial:

Create Virtual environment:

I prefer using Conda to set up a virtual environment, but feel free to use any tool that suits you. To create the environment, simply open Anaconda Prompt and enter the command.

conda create --name myenv

To activate your Conda environment, simply type.

conda activate myenv

Install libraries:

Now that the environment is set up, let’s create a requirements.txt file and install the necessary libraries. 🛠️✨

google-generativeai==0.3.2
streamlit==1.30.0
python-dotenv==1.0.1
streamlit-option-menu==0.3.12
pillow==10.2.0

Now, open terminal and type this command

pip install -r requirements.txt

Get Google Gemini API:

Now that we’ve got our libraries installed, let’s grabbing the Google Gemini API! .🚀 Here’s how you can get it:

Head to Google AI Studio: Navigate to https://cloud.google.com/generative-ai-studio using your Google account.
Create a New Project (Optional): If you don’t have an existing project, click “Create new”.
Get Your API Key: Look for the “Get API key” button in Google AI Studio. Clicking this will create a new Google Cloud project (if you haven’t already) and generate a unique API key for you.
Copy and Paste the Key: Make sure to copy this API key securely. You’ll need it later when integrating the Google Gemini API into your Streamlit web application

After copying the API, create a .env file within your project and paste the API into it. ✂️📝

Code For The APP:

Let’s start coding for our Streamlit app and integrate it with Google Gemini AI! 🚀 Start by creating an app.py file in your project.

import os
import streamlit as st
from dotenv import load_dotenv
from streamlit_option_menu import option_menu
from PIL import Image
import google.generativeai as genai

# Load environment variables
load_dotenv()

GOOGLE_API_KEY = os.getenv("api_key")

# Set up Google Gemini-Pro AI model
genai.configure(api_key=GOOGLE_API_KEY)

1. Retrieving the API Key:

GOOGLE_API_KEY = os.getenv("api_key")

This will retrieve the value of an environment variable named "api_key" and assigns it to the variable GOOGLE_API_KEY. Here's what each part does:

os.getenv("api_key"): This uses the os module's getenv function to access the environment variable named "api_key".
GOOGLE_API_KEY = ...: This assigns the retrieved value (your API key) to the variable GOOGLE_API_KEY.

2. Configuring the Google Gemini-Pro AI Model:

genai.configure(api_key=GOOGLE_API_KEY)

# load gemini-pro model
def gemini_pro():
    model = genai.GenerativeModel('gemini-pro')
    return model

# Load gemini vision model
def gemini_vision():
    model = genai.GenerativeModel('gemini-pro-vision')
    return model

# get response from gemini pro vision model
def gemini_visoin_response(model, prompt, image):
    response = model.generate_content([prompt, image])
    return response.text

1. Loading Gemini Models:

gemini_pro() function:
Loads the Gemini Pro model using genai.GenerativeModel('gemini-pro') specifically for text-based tasks.
Returns the loaded model for further use.
gemini_vision() function:
Loads the Gemini Pro Vision model using genai.GenerativeModel('gemini-pro-vision'), designed for tasks involving images.
Returns the loaded model for later utilization.

2. Generating Responses:

gemini_visoin_response(model, prompt, image) function:
Designed to obtain a response from the Gemini Pro Vision model.
Takes three arguments:
model: The Gemini model to use (presumably loaded using gemini_vision()).
prompt: A text prompt to guide the model's response.
image: An image to provide as input for the model.
Uses model.generate_content() to generate a response based on both the prompt and image.
Extracts and returns the text-based response from the model’s output.

# Set page title and icon
st.set_page_config(
    page_title="Chat With Gemi",
    page_icon="🧠",
    layout="centered",
    initial_sidebar_state="expanded"
)

with st.sidebar:
    user_picked = option_menu(
        "Google Gemini AI",
        ["ChatBot", "Image Captioning"],
        menu_icon="robot",
        icons = ["chat-dots-fill", "image-fill"],
        default_index=0
    )

Configuring the Web App Appearance:

st.set_page_config

This function sets various visual aspects of your web app:

page_title: Defines the title displayed in the browser tab (set to "Chat With Gemi" here).
page_icon: Sets a small icon displayed next to the title (set to a brain emoji "" here).
layout: Determines how content is positioned within the app window ("centered" in this case).
initial_sidebar_state: Controls the initial visibility of the sidebar ("expanded" means it's open by default).

Creating a Sidebar Menu:

The code within with st.sidebar: block defines a sidebar element in your web app.
option_menu function creates a menu with selectable options:
"Google Gemini AI": Sets the overall title for the menu.
["ChatBot", "Image Captioning"]: Defines the list of options users can choose from.
menu_icon: Sets an icon displayed next to the menu title (a robot emoji "robot" here).
icons: Defines a list of icons corresponding to each menu option (chat and image icons here).
default_index: Sets the initially selected option (0 means "ChatBot" is chosen by default).

def roleForStreamlit(user_role):
    if user_role == 'model':
        return 'assistant'
    else:
        return user_role

This function roleForStreamlit deals with user roles within a Streamlit web application. Here's a breakdown of its purpose:

Function: roleForStreamlit(user_role)

Input: It takes a single argument named user_role, which represents the user's original role in the application.
Logic: It checks if the user_role is equal to "model" (notice the quotes, indicating a string comparison).
If it is "model", the function returns the string "assistant". This suggests that for users with the "model" role, their role gets translated to "assistant" within the Streamlit context.
If the user_role is anything other than "model", the function simply returns the original user_role without any modification.
Output: The function returns a string representing the user’s role within the Streamlit web application.

if user_picked == 'ChatBot':
    model = gemini_pro()
    
    if "chat_history" not in st.session_state:
        st.session_state['chat_history'] = model.start_chat(history=[])

    st.title("🤖TalkBot")

    #Display the chat history
    for message in st.session_state.chat_history.history:
        with st.chat_message(roleForStreamlit(message.role)):    
            st.markdown(message.parts[0].text)

    # Get user input
    user_input = st.chat_input("Message TalkBot:")
    if user_input:
        st.chat_message("user").markdown(user_input)
        reponse = st.session_state.chat_history.send_message(user_input)
        with st.chat_message("assistant"):
            st.markdown(reponse.text)

This code will build and run a chatbot interface within a Streamlit web application. Here is break own step by step:

Checking User Choice: It first checks if the user has chosen “ChatBot” from a previous interaction (possibly through a selection menu).
Initializing the Model: If the user selected “ChatBot,” the code creates a new instance of the gemini_pro() function, to interact with the Google Gemini Pro API.
Initializing Chat History: It checks if a key named “chat_history” exists in the Streamlit session state. This session state persists data across app reruns. If it doesn’t exist, it initializes the chat history using the start_chat method of the model with an empty history.
Setting the Title: The code then sets the title of the app to “TalkBot”, indicating the chatbot functionality.
Displaying Chat History: It iterates through the existing chat history stored in the session state. For each message, it uses the st.chat_message function with a dynamic role based on the message sender (likely "user" or "assistant") and displays the message text using markdown formatting.
Getting User Input: The code provides a chat input field where the user can type their message to the chatbot.
Processing User Input: If the user enters a message, it displays it in the chat history as a user message. Then, it uses the send_message method of the current chat history object (from Gemini Pro API) to send the user input to the model.
Displaying Bot Response: Finally, the response received from the model is displayed in the chat history as an assistant message using Markdown formatting.

if user_picked == 'Image Captioning':
    model = gemini_vision()

    st.title("🖼️Image Captioning")

    image = st.file_uploader("Upload an image", type=["jpg", "png", "jpeg"])

    user_prompt = st.text_input("Enter the prompt for image captioning:")

    if st.button("Generate Caption"):
        load_image = Image.open(image)

        colLeft, colRight = st.columns(2)

        with colLeft:
            st.image(load_image.resize((800, 500)))

        caption_response = gemini_visoin_response(model, user_prompt, load_image)

        with colRight:
            st.info(caption_response)

This code will handles the “Image Captioning” functionality within Streamlit web application. Let’s break it down step-by-step:

Conditional Check: It first checks if the user has selected “Image Captioning” from a sidebar.
Model Initialization: If “Image Captioning” is chosen, a model named gemini_vision() is created. This function loads the pre-trained model from Google Gemini specifically designed for image captioning tasks.
UI Elements:

The code creates a title (st.title) displaying "️Image Captioning" to inform the user of the current functionality.
It uses st.file_uploader to allow the user to upload an image file. Supported formats are restricted to JPG, PNG, and JPEG.
An additional text input field (st.text_input) is created with the label "Enter the prompt for image captioning:". This allows the user to optionally provide a prompt or additional context for the image captioning process.

4. Button and Processing:

A button labeled “Generate Caption” is created using st.button. Clicking this button triggers the following actions:
The uploaded image is opened using Pillow’s Image.open function and stored in the load_image variable.
The code splits the display area into two columns using st.columns.
The left column (colLeft) displays the uploaded image resized to a width of 800 pixels and a height of 500 pixels using the resize method. This is achieved with st.image.
The right column (colRight) utilizes st.info to display the caption generated by the Gemini vision model. This likely involves a separate function, gemini_visoin_response, that takes the model, user prompt (if provided), and loaded image as inputs, processes them through the model, and returns the generated caption.

Now let's run our app.py file. open the terminal and type command streamlit run app.py

Generated text by the google gemini pro model:

Caption generated by the google gemini pro vision model:

Conclusion:

In this tutorial, we explored that how to build Streamlit web app that leverages the power of Google Gemini AI for Chatbot and image captioning. and now it’s your turn to explore further. Experiment with different prompts, try uploading various images, and witness the fascinating capabilities of Google Gemini AI in action. You can even extend this codebase to incorporate other functionalities offered by Google Gemini ai, like text embedding!

Thanks for reading; if you liked my content and want to support me, the best way to supporting me on Patreon —

Subscribe my YouTube channel
Connect With Me On LinkedIn and Github where I keep sharing such free amazing content to become more productive and effective at what you do using Technology and AI.
Need help with ML & DL? Check out my Fiverr services!.