On-Demand Training

Docker and GenAI

 

Transcript

Hello, welcome to this tech talk session about the Docker GenAI Stack. About the agenda, I will do a quick introduction with some reminders. I will introduce the Docker GenAI Stack and will speak about the core components of the stack and I will explain to you how you can do your own Docker GenAI Stack with two demonstrations. So you are not obliged to be a data scientist to understand this presentation. I’m doing it with a developer perspective.

 

Table of Contents

 

Developers use AI every day (0:38)

So as the developers, we use AI everyday. On my side as a an old developer, I don’t really like AI code suggestions, like Copilot in VSCode, but I really like AI sidekicks like Copilot Chat or Gemini or ChatGPT and so on. What if we can develop easily our own AI sidekicks and other AI applications? My perspective for this presentation, again, is a developer perspective.

My day to day personal usage of AI are numerous and various. I use Grammarly a lot. (Grammarly uses AI), for writing English documentation. I use it a lot for translation between French to English and English to French. I ask often chat GPT or Gemini to generate for me some source code samples about a specific topic. It’s like documentation, for example. I use it for doing summary to explain some concepts or even gardening my ideas.

 

What is Generative AI (1:59)

Generative AI (or generative artificial intelligence) refers to a branch of AI, focusing on creating new content like text, images, music, or video. It’s mainly based on large language models. You can find a lot of use cases like chat bots, content creation tools, code generation tools. You can use even for medical research and so on. Some well known examples are Midjourney to generate images. Of course, GitHub Copilot, GitLab Duo, Codeium. My favorite again is GitHub Copilot Chat. Gamma Application to generate fancy presentations and so on.

 

Introduction to Docker Gen AI Stack (2:53)

Now, let’s talk about the Docker AI Stack. At Dockercon in 2023, last year, Justin Cormack, the CTO of Docker was on stage to announce the GenAI Stack project. The GenAI Stack project is a collaboration between Docker, Neo4j, LangChain, and Ollama, for facilitating the development of GenAI applications. Now, let’s talk about the advantages of using the Docker GenAI Stack. The Docker GenAI Stack is a Docker Compose project made to orchestrate the various components of the GenAI development project. So you will have a quick starter. With the GenAI Stack, Docker ensures developers can initiate their AI project quickly, eliminating the typical roadblocks associated with blending diverse technologies. You will have a set of comprehensive components. The stack comes bundled with a range of pre-configured components, primed for immediate coding. And you will foster your innovation. You will boot strap faster to your project. That means you will have more time to innovate and to experiment a lot.

 

Core components of the Docker GenAI Stack (4:10)

About the core component of the Docker GenAI Stack. The Docker GenAI Stack is a comprehensive set of tools and resources designed to facilitate the development and deployment of GenAI applications. In other words, it’s a Docker Compose project that helps to start a GenAI project with Ollama, Neo4j, and LangChain. And if you don’t know Docker Compose, Docker Compose is a tool for defining and running multi-container Docker applications using a YAML file to configure all the application services. Docker Compose comes with Docker Desktop. About Ollama – Ollama is an open source application that helps with the local deployment and execution of LLMs, like Llama 2, 3, Mistral, Gemma, and so on. With Ollama, you will get an API interface. It’s OpenAI compliant and you will use it to create your GenAI application. Neo4j, it’s a graph and vector database, very flexible. And with that, you will enable the retrieval-augmented generation for LLMs or RAG for LLMs. I will speak about RAG in a moment.

 

Ollama can run LLMs (5:36)

But let’s see first how Ollama works. So Ollama is like Chat GPT on your workstation. Most of the time you will have a prompt, a question, and you will send to Ollama and LLM – thanks to the LLM API – your question. Ollama will query the LLM and return the answer. So in this example, I will ask “who was Robin Hood?”. Ollama will query the LLM with my question and will answer something like “Robin Hood is a legendary figure from English folklore, celebrity as a heroic outlaw,” and so on. But it’s not magic if you ask a second question, “who is his best friend?” Very often you should get the answer “What?” Because in fact, Ollama and the LLM didn’t make the link between the previous question and the last question. That means you need to provide a more detailed prompt. So you will put in the prompt the first question “who was Robinhood?” You will add eventually the first answer and then you will add the last question “who is his best friend?” In this case, the LLM will be able to make the link between the two questions. So it will answer, “By the way Robin Hood has two best friends, Little John and Friar Tuck.” Of course, it’s only an example.

But sometimes it’s not enough. You have a list of questions in your prompt. Every time you will update the prompt at each question and answer. But at a moment, Ollama and LLM won’t have all the information, all the data, so it will answer, “I’m sorry, I’m afraid I have not this information.” It’s why you need to add more data into the prompt. You will add some extract of the document, for example, and to complete the existing data of the LLM. And then, Ollama will be able to answer that “the worst enemy of Robinhood is the Sheriff of Nottingham.” So you need to provide more context with your prompt. But the size of the prompt is limited. We cannot put everything in the prompt.

 

RAG: Retrieval-Augmented Generation (8:39)

Now, I will speak about RAG. So RAG for Retrieval Augmented Generation. RAG is helping the LLM to understand the real work context. It’s helping the LLM to generate more accurate answers. So you will get data from various sources like your documentation, internet, other project documentation, etc. You will change the data and then you will store the result in the database. In the case of the Docker GenAI stack, we are using Neo4j. And when you will prompt your application with a new question, the application will first search some similarities in the database. And then after that, the application will retrieve all the chunks related to your question. And the application will be able to make smarter prompts to query the LLM. For that, you need to calculate the embeddings. It’s like vector coordinates for every chunk, for your question. And the database or your application will be able to do a kind of distance calculation to find what are the chunks related to your question. So sometimes this part could be complicated if you have to do it from scratch.

And for that, we have some programming frameworks like LangChain. LangChain is a powerful framework designed to simplify the development of generative AI applications. It helps to combine various tools and technologies seamlessly and it supports diverse programming languages like Python. And there is an official Python SDK for LangChain. You have an SDK for JavaScript. Again, it’s an official SDK. But you can find other frameworks not created directly by LangChain, like LangChain4J, to help you to develop a GenAI application. But in this case, in Java. And I will use LangChain4J in my demonstration.

Let’s go back again to the Docker GenAI Stack. With the Docker GenAI Stack project, we’ll find several samples of Python GenAI applications. Again, Docker Compose is the orchestrator of the GenAI components. So you will clone this project on GitHub. And then with a simple Docker Compose command, you will be able to start all the stack and the applications. So in the Compose Stack, you will have an Ollama server to serve the API. By default, the stack will download the Ollama model. You will have a Neo4j database for the RAG sample. And the samples are first a loader that will fetch some information from Stack Overflow. And the bots support the second sample. We’ll use the data from Stack Overflow and Neo4j, because the loader puts the data inside Neo4j. The bots will use this information to give you accurate answers. So it’s a technical bot and using the data from Stack Overflow to speak, to have a conversation with you about some technical topics. There is a PDF bot too, so you can upload PDF documents to the bots. And after that, the bots will be able to give you information about the document. It’s like if you are chatting with your documentation. And there are two other samples, an API one. So you can use an API to connect you to the Ollama server. And you can use it for your own Python application. And there is an example, the front-end sample project that is using the API. So you have a lot of samples and you can use it to start your own project, especially if you are doing a Python application.

 

Make your Docker GenAI Stack (14:03)

So now let’s see how to make your Docker GenAI Stack with Docker Compose, of course. You can hack the existing GenAI Stack if you want, especially if you are using Python. I’m not a Python developer. You can find ideas and tips from the source code of the existing project, the official Docker GenAI Stack, and start your own stack from scratch with your favorite language. For example, I did a small chatbot with Vert-x and LangChain4J. Initially, I’m a JavaScript developer, but I want to learn other programming languages like Golang. And I need a little teacher to help me. To create this GenAI application, I used the same logic of the Docker GenAI Stack blueprint. It’s exactly the same Docker Compose structure, but in this case, I will use Java, LangChain4J. And I will use a small LLM, deepseek coder. With this application, you can change the model. So I use all deepseek-coder. Deepseek-coder is more for developing tasks. But you can use TinyLlama. It’s a small LLM too, and then you can, in fact, start a GenAI application even if you have no GPU on your machine.

So my Compose project is split into three services. One is the Ollama service, so I will start Ollama inside the container. The second one is to download the LLM if the LLM doesn’t exist on my laptop. And at the end, the third one is my web application, and I did it with LangChain4J and Vert-x.

Another remark: you can run Ollama in the container. So you can access Ollama from outside with this URL, so a localhost:11434. And you can reach the application with the localhost:8888 ports with your browser. And inside the container, the Compose stack, the web application will communicate with the Ollama service, with this URL, the name of the service is defined in the Compose file. But there is another way to run Ollama. You can run it outside the container locally on your workstation and use it from a container. So in this case, I have only the web application that is running in the container. And I will use this DNS name, host.local.internal, to access to the external running Ollama. So the Ollama, which is running on my laptop, for example, I’m using a MacBook Pro M2. So if I use this way, I will benefit from the GPU of my laptop.

 

Demonstration 1 (18:19)

Well, let’s see the source code of the first demonstration. I did a Compose file with three services, one to run Ollama, one to download the LLM for Ollama, and one for the web application. Thanks to the profiles, I can start only some services. So for example, if I want to use only the Ollama, which is running directly on my workstation, I will use the web app service. Otherwise, I will use the container service to load the three services. So the Dockerfile is simple. So it’s for to Dockerize the web application. I have a first stage to build the application, a second stage to build Java runtime, and a third stage to create a new image with the Java runtime and Java web application. The source code with LangChain is simple to read. I need the URL of Ollama, the name of the model. I will start what I call a streaming model. So it’s an object that allows you to stream the answer of the model to a client. I will use this object to manage the memory of the conversation between the user and the LLM. And at the end, I have a specific route slash prompt to post the question, the system instruction for the model, eventually context. With that, I will create a list of messages plus the memory messages. And then I will be able to send a prompt to the model. And I will be able with the streaming response handler to stream the answer of the model to the client.

So let’s start the demonstration. So I will use the deepseek coder model. And I will use the Ollama, which is running locally on my laptop. Okay, so let’s switch to Docker Desktop to check if everything is running. Yes. So I can reach my web app from here. So now, I will switch to the browser. So you can see my web app. Then I can give some instructions to the model. You can post it from the web app or you can write this instruction on the backside in the source code Java. I don’t need context for this demonstration. And my question is here, in fact, I need a simple explanation about Golang “hello world” program. So I will submit my prompt. So the prompt is the human question and the system instruction. So you see that the LLM was able to generate a “hello world” program. I will ask for more things. For example, I keep the same instructions. And I will ask to add a structure “human” with the following fields. I need the name, the age. And I submit again. So you see the LLM was able to address the main command with a human structure. And it can use the new struct. I will try a last question. “Can you add a greetings method to the struct?” And I click on submit again. And you see the LLM added the greet method and call it in the main function. If I open the developer tools, click here. I click on the pre-conversation summary to the console. You can see that, in fact, I kept all the conversations in memory. And then the model is able to make the link between every question. So that was the first demonstration. So now let’s go back to the presentation for the second demonstration.

 

Demonstration 2 (24:36)

The second demonstration is about RAG, again with Vert-x and LangChain4J. The Compose file is almost the same as the previous presentation. But this time I will use another model. I will use Gemma. It’s a small model from Google DeepMind. I will use the concept of embeddings. But the Compose file is the same. Let’s look to the source code of the new demonstration. The Compose file is pretty similar to the previous Compose file in the first demonstration. The same remark for the Dockerfile of the application. I changed something in the Java source code. And in fact, the application is a kind of bot that will be able to give me some answer about the rules of a role-playing game. The rules are very short. And then the application will chunk the documents, put the documents in the embeddings of the document in a vector store. And then I will be able to do some similarity search to send a prompt to the LLM. I will add something in the document. I will add a new monster, named Keegorg. In fact, it’s my under on Twitter. And so Keegorg is a Senior Solution Architect at Docker. I will edit the rules like this. I will add the details at the end of the document here. I save the document.

Now, let’s have a look to the Java source code. Again, it’s with vert-x and LangChain4J. LangChain4J comes with very interesting tools, in fact. So I will read the rules document. It’s just Java source code, classical Java source code. But I will use the document spitters to create the chunks. So document spitters is an object from LangChain4J. After LangChain comes with in-memory embedding models. So it’s very easy to create an application to experiment with RAG. So with embedding model .embedall method, I will be able to create all the embeddings for every chunk of the document. After that, again, LangChain4J came with in-memory vector database. So again, for experimenting, it’s very useful. After that, with a real application, you need to use Neo4j, for example, or a user database. And I can add all my embeddings in the in-memory vector store. There is another object, a content retriever. Again, thanks to LangChain4J. And with the content retriever, I will be able to find all the similarities in the documents that are near to my questions.

And after that, I will create a LLM object. And I will use in the prompt root. So from the question and system data or value from the web app page, I will be able to create the system instruction and human message. With the similarity, I will create a context message. So with every similarity, I will create some records. In fact, in the context. With all the message from system instruction, the conversation memory, the context message, and the human message, I will create the prompt. And again, with the streaming reference handler, I will be able to stream to the web client the answer of the LLM. So I will start my application with the Gemma model. I will use again the Ollama instance, which is running locally on my laptop.

So let’s start the application. Go to the Docker Desktop. So you can see that the application is running. So I can launch the browser to go in my application. This is my new web application. You can see the instruction for the system. I explain to the model that it is a dungeon master. And his job is to give information on the rules document. And this is my prompt. I can ask something, for example, the list of the players of the game. And click submit. So I have the list. And now I will try the same question. But with the monsters of the game, I submit it again. So you have all the monsters and me, with my knowledge about my knowledge on the Docker Compose and Kubernetes and my weird English. So you see, you can use AI various use cases. The RAG one is honestly very interesting because you can use it for your project documentation, for games like this, and so on. Thank you very much for your attention.

 

Learn more

Dive into the world of GenAI app development, starting with the components of this new stack and how it can easily fit into your existing containerized development workflow.

Our speakers

Philippe Charrière

Senior Solutions Architect
Docker