DockerCon

AI Anytime, Anywhere: Getting Started with LLMs on Your Laptop Now

Matt Williams, Maintainer, Ollama.ai

Recorded on November 25th, 2023
This presentation shows how to use the open source Ollama tool to download and run an LLM model on your laptop. From there, you’ll learn how to customize the model and build AI applications you can use right away using NodeJS and Python.

Transcript

Hello everyone. This session is called AI anytime, anywhere, getting started with LLMs on your laptop or whatever your computer is today. I’m going to start with a question, which is how many here are using some sort of AI service today? Yes, well, most people. Cool.

How many of you are using some sort of local AI option? Now by local, I mean not using ChatGPT or Claude or even Copilot. Cool. Okay. And then how many of you are using Ollama? Okay. We got one of the maintainers, and he’s using Ollama. That’s great. That would be scary if he didn’t raise his hand.

Table of Contents

    Introduction

    I’m one of the maintainers of a project called Ollama.ai, and I want to share with you a little bit of background on LLMs, but also get into how to use Ollama and what kind of things you can do with Ollama. So, AI is not a new thing. It’s been around for a long time. There’s been computer science research into artificial intelligence for decades and decades. Probably 20 or 30 years after things really got started, I took a course when I was in college, an introduction to artificial intelligence programming. And that was 32 years ago. Things have progressed a lot since then. The AI that I learned in school is nothing like what we’re seeing today.

    But, as with any long road that we are on, any long road that we take, where are we on this road? Are we at the beginning? Are we at the end? Is AGI (artificial general intelligence) tomorrow? Or are we just at the beginning, so it’s going to be another 100 years, 200 years before we see something that’s really that AGI ideal?

    Trusting without verification

    As with any long road that we take, sometimes a big rock falls in front of us. There’s a big obstacle, and we have to adjust to the new landscape. One of the problems is that people are using these AI solutions, and they’re asking questions, and they’re trusting, without verification. Yeah, mind boggling that people are just going to maybe ask ChatGPT a question. They get back an answer and they look, okay, that must be true. And just go with it.

    There are problems when you do this. One of the classic examples is that lawyer, a few months ago. He’s going into a court case and wants to generate a document to hand off to the judge and to the other party. And he uses ChatGPT to create this document, and it looks really impressive. I mean, that looks great. That’s better than what I could do. Hands it off to the judge and to the other side. They look at it, and they say, whoa, these guys have a really great case. And they start clicking on the links for the previous cases, and none of them exist.

    Other problems

    So, there are problems — just like, you know, as a developer, you never trust the input you’re given. You’ve always got to verify what you’re seeing. And there are other tools that are integrating with these online services. I use a tool called Obsidian, which is, I think, great. When I originally started looking into this presentation, there were like six or seven plugins that use ChatGPT. I looked this morning, and there were 22 plugins that use ChatGPT. And they’re submitting all your notes up into ChatGPT or other open AIs to index and to tokenize. And then they’re asking questions against that data. Maybe that’s a good thing, because personal data means more personalized results. So this is awesome, right? Not always. Because sometimes you submit information to ChatGPT and it becomes part of the model. Maybe it becomes part of the model right away or a little bit later.

    There’s an example that happened with Samsung. This was back in April, May. A couple of engineers were involved. One of them didn’t attend a meeting, or maybe he did, but there were notes that were generated from that meeting. He just took the whole set of notes — he didn’t want to read all the notes — submitted it to ChatGPT and said, summarize this. And it did! It came up with this great summary. If you search OpenAI’s website, you’ll eventually find a link to a Google form that allows you to opt out of sharing your data with the model to make the model better. And these guys hadn’t done that. And so they had a meeting about this internal project, and sometime later a journalist called Samsung and says, hey, can you comment on project XYZ? They said, well, how do you know about that project? You shouldn’t know. I just asked ChatGPT. So now Samsung has a policy of no one uses ChatGPT.

    I’ve heard this from other companies where they don’t, either they don’t allow access to ChatGPT in general, or you have to use their special hosted version that’s on their Azure. But some other researchers saw this and thought, well, if good results are making it into the model, what if we have some fun with it and put in some bad information? What happens then? And, yeah, they’ve seen that bad information can go in and skew those results for future requests.

    Now I love this. This is from Midjourney’s licensing or terms of service. You scroll down; it’s a long legal document. You scroll down. It says, if you’re not a paid member, you don’t own the assets you create. Instead, Midjourney grants you a license to the assets under the creative comments, non-commercial, attribution, international license. So if you’re creating an icon, and you use it for your business, and you use it for free, you got some issues. I guess you have to get caught, but then you got some issues.

    So there are lots of these problems, and lots of companies are generating policies around who can use the online services, because we don’t know who owns the IP, the answers that come out. Or, we don’t know if we even have legal rights to those answers. So maybe, not the entire solution, but a solution for a lot of these problems is perhaps going local. Thankfully, the tool that a lot of the tool runners use is called Llama.cpp. And Mr. Gerganov from Bulgaria was nice enough to give us a code sample. Here’s how you might run it if you wanted to host your own model. And it’s only 1,300 lines of kind of intense C++ code. So, easy afternoon read. And so you could implement it too, but there’s other ways to do this, and there are easier ways to work with this. But first, I have some definitions that I want to cover, and I might as well tell you a little bit about myself.

    Definitions

    So I’m Matt Williams, and one of the maintainers of a project called Ollama.ai. There, we’re a team of seven people, about five of us are at the booth. So after this session, you can go ask them for more information. I see some familiar faces, so I’ve seen a lot of you at our booth, which is awesome. If you have any questions afterwards, and we don’t get a chance to answer them here, then you can reach me on all the socials, where I’m “technovangelist”. The one place I’m not “technovangelist” is on Ollama.ai. We’ll talk about the idea of being able to push models, but I have a bunch of models on Ollama.ai slash Matt W.

    Large language models

    Let’s talk about some terms. Large language model, LLM. What is that thing? The whole goal of a large language model is basically to predict the next word. If a word gets plopped down based on the context, based on the prompt that was asked, and the word that’s there, what’s the most likely next word? And then based on that word, those two words, and then the context, what’s the most likely next word? And so it keeps building out all those words until it ends up with the final sentence. Yeah. So it just builds out everything as we go. So when it writes down that first word, it doesn’t know necessarily what that last word is going to be, or how long it’s going to be. It just kind of figures out, just like we do, as we start saying — at least for me — I have no idea where I’m going to end up with a sentence until I get there, although I knew roughly what I’d say for this presentation.

    Tokens

    That’s generally what an LLM does. LLMs are made up of lots of pieces. And one of the main pieces that goes into an LLM is a token. Now, a token is kind of like a word or a word part. It’s really the most common words and parts of words that are around. So you’re all sitting on chairs. A chair, well, the word chair is probably going to be a token all on its own. A table would be probably a token on its own. But podium, I’m at a podium, maybe that might not be a word or that might not be a token. Maybe it would be pod and ium as two different tokens. And hippopotamus might be — I don’t know why I chose hippopotamus — but three or four tokens might go into that word. So, it tries to find those most common word parts and turns those into tokens. And tokens don’t get stored as chair; it’s not CHAIR that gets stored in the LLM. It is a token. It is a number, a numeric value that gets associated with that word. And every time that word shows up in the stuff that gets trained on, we’re just storing that token or referring to that token.

    Weights and biases

    Then there are weights and biases. In the model, there’s all these tokens that are in there. And all these tokens are linked to each other by a weight. The weight just says how related are these two word parts or tokens. And it’s not just on a single dimension. It is thousands of dimensions. There’s all these links between all these different words or different tokens. And, initially, when we first build out the model, all those weights, they’re all the same. But, gradually, we start to train the model. And training adjusts those weights so that some of the dimensions are really close to each other. And then other dimensions are really far apart from each other. So those are the weights, that’s basically how the model knows, figures out, goes through all the nodes, goes through all the different layers, and figures out, based on those weights, how close those words are together, based on the context, what is the most likely next word?

    Quantization

    Now you add up all the weights and biases, and you get parameters. So when we talk about a 70-billion parameter model, the parameters, those are the weights that connect the different nodes together inside the LLM. So quantization comes along. This is a big word, big complicated topic. But at a high level, those weights are stored as 32-bit floating point numbers. And I said those weights, you add them all up together, that’s how many parameters there are. So 70 billion, 32-bit floating point numbers, that adds up to a lot of data. Even just 7 billion. So 7 billion, 32-bit is four bytes. It roughly adds up to about 28 gigabytes of storage, just for the 7-billion parameter model. So 70 billion, add another zero to that. So this is huge file sizes, but quantization takes those 32-bit floating point numbers and basically puts all the weights into different buckets, into groups of weights, and then puts each of those weights in a box. So as I said at the bottom of this box, we’ve set up a bin for each of our quantization amounts.

    Usually, the models are quantized, often four-bit quantization. And the four-bit quantization means that it’s four bits, that’s four bits is up to 16, zero to 16. So at the bottom of our box, we have a bin for 16, I guess it would be 17 bins, the values can go into. So, if we’ve got maybe all of our values are between minus one and positive one, that first bin is minus one to minus 0.8. And the next bin is 0.8 to 0.6. And the next bin is 0.6 to 0.42 words. Something like that, all across that bottom of that bin. And all the weights just kind of fall into the right bins at the bottom of the box. And that way, the first bin, well that’s bin zero. And that’s the value that we store for that weight, rather than this long 32-point floating point number, which just stores zero or one or two. It turns out, when you do this, and when you’re actually doing inference, going down to the 16-bit numbers or four-bit numbers, still works really well. It’s shocking that even sometimes down to the two-bit quantization, it’s shockingly good for how small you can get these models.

    Then there’s another thing with quantization called k means. And the idea of k means is sometimes when you put all those weights into a box, and you drop them into those different bins, sometimes there’s a lot of weights towards one end and a lot of weights towards the other end and the middle is kind of empty.

    With the k means, we’re basically figuring out what are the clusters of weights and just creating a more precise bin for some of the ones at this end, more precise ones for the ones at this end, and skipping all the numbers in the middle. We can be a lot more precise, but that requires also storing kind of a map so that the system can figure out, well, what does this quantization value actually mean? So, that’s a little bit of how that works. That’s probably one of the more complicated topics in all of this.

    Before ChatGPT, we would go to a site like Hugging Face to download our models. Here’s a screenshot from Hugging Face, and if we actually clicked on here, we’d see the different quantized versions — we definitely see 70 billion, 70 billion, 13 billion. So we see the parameters listed in the file name, and if we clicked into any one of these, we’d see the Q, or what quantization number it is.

    Demo

    At this point, I want to get into a demo, and I forgot to pull out my little demo script to ensure I cover all my stuff. So let’s make sure I’ve got Ollama, Okay, so I’ve already actually downloaded Ollama because I’m not going to do a download right here, because, you know, internet, Wi-Fi, conferences, not so good.

    But, I’ve already downloaded it, so what I do is run Ollama and as soon as we run it, it’s going to say, hey, good. Hey, do you want to move this to the applications directly? Yes, I do. And when we do that, it’s going to pop up a box that says, hey, welcome to Ollama, click on next, install the command line, yes, I do. And I’m just going to type in my password, and now I can run my first model. Now the installation process is a little bit different on Linux. So on here, we have a simple script that you can run so you can go to the Ollama website, click on download, and then here’s the script that you would run on Linux, and it’s the same thing if you were on WSL on Windows.

    So, I’ve downloaded Ollama, and now I want to run it. Let’s create a new… there we go. Where’s my terminal? Hi, term, come on, where’d you go? Okay, it’s not there, not anywhere. Nice, this is always an awesome part of the demo where my terminal doesn’t even pop up. Let’s use this thing, oh, that’s weird. Okay, we’ll use this one over here on the side, and we’ll bump that up.

    Okay, let’s clear the screen, I want to do Ollama run, and I’ll use Llama 2. You can see at the end of this string, I’ve got some tags that say I want to use a different model other than the default. I’m just going to stick with the default, and now I can run my prompts. Why is the sky blue? And it thinks about that for a second, and then spits out an answer of why is the sky blue? And the time that it’s taking, sometimes that’s just loading up the model or it sometimes just takes a bit of time. So, why is the sky blue? Okay, here’s why it is blue, but we also keep the context just like if you were using ChatGPT. We keep the context here, so I can say, is it ever green? And it knows I’m talking about the sky, because I mentioned that in the previous question. So we keep that conversation.

    Models and layers

    I just use Llama 2. Llama 2 is one of many models you can install. If you come to Ollama.ai and click on this models link, here are all the models that we have in our library today. You can see Llama 2, Llama 2 uncensored, Code Llama, Code Up, yada, yada, we’ve got a lot of models. SQLcoder, starcoder, Fakuna, Wizard, bunch of wizards. And if we come back up, Mistral was one that was added a week ago.

    That was pretty exciting. So, yeah, we’ve got a lot of different models. In fact, we can also allow you to push your own models up as well. If I go in here, and I do Ollama pull Llama 2, we already have it downloaded. But you’re going to see something that might look a little familiar if you use Docker. Each one of these lines here is basically another layer within the model. And so we have this idea of a model file, which I’ll show you in a second, and the model file you define much like a Dockerfile. Then we store the actual contents in a series of layers, and you’re downloading those layers. So if one model shares a layer with a different model, you won’t have to download that layer again.

    Let’s see. Now, I’ve actually got a lot of models on my laptop because I’m trying everything. It’s good that I have four terabytes, because I would run out of disk space every day. Because if I run Ollama list, you’ll see a lot of models that are mine, but also a lot of the library models, including lots of versions of Llama 2.

    There’s all the models that I’ve got on my machine, but let’s try making a model file. I’m going to go into Visual Studio. So here is a simple model file. That might look kind of familiar. It looks a little bit like a Dockerfile. Right? At the top, I’ve got FROM Llama 2. This is from a model that I am pulling out of the library. Now, if you have downloaded a model from Hugging Face, and it’s already quantized and you want to use that, you can say instead of FROM Llama 2, you say FROM where it is on your disk so that the model file or Ollama can find it. So you could use those as well.

    Sometimes these models use parameters. So in this case, I’m setting the temperature parameter to 1. Then there is a template that’s associated with Llama 2. I am inheriting that template because I’m not overriding it. If I included template in here, I would overwrite that parent. And here I’ve got system. You’re an experienced product manager at a tiny software company, only speak English when you respond to anyone, only offer a single open question rather than a list of needs… Okay, so if I want to create this model, all I have to do, because I’ve got the model file, and I’m in this directory, no, I’m not. Let’s go into this directory. I can just do Ollama create. And I’ve called this XPM. I’ve actually done this before, but I can do it again. And it reads the model file, figures out if there’s any layers that need to be created. I’ve already created this, so it knows that all this stuff already exists. So, I use these already existing layers. Now I can do Ollama run XPM. And I want to build a tool to track water intake, because I need to take some more water. Interesting. What kind of insights are you hoping to gain? Every time the person asks you a question, answer back with one other question. You keep building, keep providing more and more information, and maybe it will help you think a little bit about what you’re doing.

    5 Whys

    Let’s go on to another, a little bit more complicated example, five whys. Sometimes, you know, one of those tools you can use to figure out why something is happening is to ask a question and then somebody asks, well, why? And you ask another, you answer it. And, you know, by five whys, you’ve probably really understood why something is happening. But if I just do a simple model file that’s from Llama to set the temperature, set a stop word, which just says, if you see this value, stop. Then when you receive a prompt, you’ll ask why the user thinks what they said, the goal is to probe, perform more information. After five rounds of questions and answers, you’ll summarize the entire conversation. If I try that, let’s go into zero, zero, two. And I’ll do Ollama create five whys.

    Again, it creates that model file. It already sees that most of them are, oh, there was one layer of this new, so it’s created that model. But if I do Ollama run five whys. And I tell it that I want to build a tool to track water intake. Great. Can you tell me more about what you want to do? And so if I do, okay, that’s three, four, and five. That should be the last one, but it keeps asking me more questions, so this isn’t really working. I’m going to have to do something else — this particular prompt, doesn’t understand that there are five questions. I need to do something at this point. So I’ve got to write some code to do this. I’ve created this simple application using TypeScript. I’m also using a frame or a module that I’ve created.

    So I’m creating a new Ollama object. Before I get to that, let me go over to the API, the actual API. If I go to GitHub, this is our full GitHub repo for this project. Everything is there. You can download the whole thing. You can see how we actually build stuff. In documentation, we’ve got an API document. In that API document, we talk about all the RESTful endpoints that are installed on your machine as soon as you install Ollama. You see, there’s a client and a server, and that server is serving out these endpoints. There’s an endpoint just for generated completion. So you do a post to slash API slash generate on localhost slash 11434. Anybody see the relevance of 11434? Llama, right? If you go with a four being an A and a three being an M, then it’s Llama, which is kind of goofy. But anyway, go to 11434 slash API slash generate, pass it a body of the model is Llama2. In this case, we’re specifying 7B. And then the prompt is why is the sky blue, and it’s going to start streaming back a response to you. Then we’ve got other ones for creating a model listing models, showing model information.

    Actually one of our maintainers, who’s sitting right here, also wrote a Python binding to make this easier to work with if you’re a Python developer. You can look at that code there. I’m more of a Node.js kind of guy and Typescript. That’s why I’m using this. I’ve created my own library that’s kind of based on his, but I first created an object called Ollama, and I’m just setting the model to be my five whys model that I created just a few seconds ago. Then I’m asking, I’m taking in some input from a readline, what’s your idea? Then it’s going to start streaming out the answer. Then I’m having it print out each word as it goes along. And it’s going to repeat that five total questions. At the end of the fifth question, I’m going to send it a new prompt, which is: you’ve reached the end of the five whys, summarize all the information so far. And it’s going to print all the words as it spits out.

    Let’s go in here. Cool. And then, yes, five whys. I want to create a water intake tracker. That’s good. Can you tell me more? Why do you want to create this? Awesome. So I can keep asking/answering the questions. So that’s two. This is three. Four. Five. Okay. Then it does the summary of all the information I collected.

    That’s just an interesting, simple example of how I might use Ollama in some code without needing anything else. It’s all just purely Ollama. Let’s go on to section three. You know, if you wanted to use Python… So if you want to see how you use Patrick’s Python binding, it’s pretty simple as well. We’re just importing the client.py file and importing client and then client generate Llama for why is the sky blue? Again, super simple to use from Python. There’s a lot more you can do from here. In fact, there’s a demo in the repo, if you go to examples, there are a couple of examples using LangChain. So LangChain built an integration for Ollama a few weeks back. And they’ve provided some documentation and examples. Actually, I’m not sure if they wrote this one, or I wrote this one. Here is an example of how you use LangChain to bring in a document. This is pulling in a PDF document, embedding it, generating all the tokens, and storing it in a Chroma DB.

    This morning at the keynote, they used Neo4j as a vector database. In this example, we’re using Chroma DB as a vector database. There are a lot of vector databases out there. Once I’ve got that all in Chroma, then I’m going to ask a question. I’m going to do a search inside Chroma to pull out the relevant stuff from that PDF document, submit that as input to my model, and be able to ask questions to that model. It’s definitely a lot more complicated than the previous examples I showed. But you could expand this out to search really neat things. And you can take a look at our examples on our repo and also on the LangChain website.

    Mentors example

    Let’s go into one other example that I think is more definitely just a fun one for me. It’s my mentors example. So a while back, I was thinking, well, it’d be really cool if I could create some mentors — different people — that I could ask a question to and say, “You know, I’m thinking about this idea, what do you think?” And it would respond from different people. Could we do that? Could we load different models that allow me to ask different people and almost have a conversation with all three people? So I’m going to ask your help for me to come up with an idea. I need an animal, and don’t say Llama, any animal? Buffalo. Excellent. I need an action verb. What? Dance? Okay. Buffalo and dance. I need a thing. Any object. Something you’d hold. Party hat. Okay.

    We want to create a business that generates party hats for dancing buffaloes. So we’re going to do that. I’m going to use Tribe. And I want to make party hats for buffaloes when they dance. Very common. This is going to make us millions. I’m sure of it. Thank you for the idea. Let’s see what comes out.

    Looks like Neil deGrasse Tyson has come in first for an answer. And so we got his answer, which is great. And you can read that if you want. But he answers in the context of astrophysics. And then Gary V. comes along and sees that question from Tyson and says, okay, well, here’s my input. I’m going to continue this conversation. And sometimes they refer to each other. Now Owen Wilson, another great mentor of mine, can come in and go, whoa, or whatever he’s going to say. That’s, oh, and sometimes it’s not perfect.

    So let’s try that one more time, because that’s kind of fun sometimes. I want to make party hats for buffaloes when they dance. Kathryn Janeway from the Enterprise is going to answer. And she’s maybe a little bit in the future. But she’s going to answer what she thinks about buffaloes and party hats. And then Martha Stewart. That’s awesome. And oh, man, I never see this error. And I’ve seen it twice now. Well, let’s do it one more time, because I’m a glutton for punishment. We don’t have time. Eight minutes. Oh, Donald Trump’s going to answer. I’m scared of what I’m going to see. Buffaloes aren’t exactly known for their dancing skills, but hey, if that’s what you want to do, who are we to judge? We move on to Kathryn Janeway. And, again, we see that error. Interesting. I have to figure out what’s going on there. I know exactly what’s going on there. It’s something stupid I’ve done. So that’s my mentors.

    Obsidian example

    How are we doing? We’ve got seven minutes. Let’s see what else I can do. So I mentioned Obsidian. So here, this is Obsidian. I just grabbed some text from a few different notes for different websites. And I created a simple plug-in. This was an example that I did for a blog post that I did about a week ago, about how would I incorporate Ollama into Obsidian? What could I do? So I just created a simple summarized plug-in that will create flashcards because that’s one of the things I often use Obsidian for. I will generate flashcards. I put them inside Obsidian. And then I have an integration with Anki, which is an SRS or spaced repetition tool, to then show me flashcards. Well, that’s the thing I keep meaning to do because I think it would be really great. I just never get around to it because creating those flashcards is always a pain.

    So I created this thing that creates flashcards, except this doesn’t always work. We’ll see if it does right now. And so hopefully this is going to go through all this text, bring it in, summarize it. Ideally, if I were smart, I would bring it all in and store it somewhere, which I’m not doing. But here are my flashcards. What’s the name of the tech company that’s being sued by the US Justice Department for antitrust? Google. Hey, that worked pretty well. So that’s good. Then I can do another example of summarize. I think summarize is actually broken. But, outline this document. And it brings that in, and it comes up with a nice outline of this document. Again, I should have saved it. Okay, so here’s the bulleted points of the document, which is awesome. And I’ve got my talk for DockerCon; I could do that same summarization or that same outline. So these are different examples. Actually all the code for this is in that blog post from, I think it was about a week or two ago, on Ollama.ai.

    Let’s just go to Ollama.ai, and I go to a blog. There is leveraging LLMs for your Obsidian notes. We’ve got a bunch of other things — we put out that blog post today about the Docker image. I mentioned that you can it install for Linux and for Mac. We also have an official Docker image. So it’s Docker run, Ollama slash Ollama. And that’s available. So that’s, I think that’s about all I wanted to cover. Hopefully, you got something out of this. Got some cool examples. Got some fun stuff.

    Q&A

    Now if you have any questions, you can ask them in four minutes and 44 seconds. Don’t all rush up at once. Okay.

    What is the footprint of running Ollama on your laptop? And, if we’re thinking of deploying it in production on Linux, what is the kind of scaling we need to run this thing?

    So, how much memory is required? What are the resources required? Olama itself requires very little. But the models; they require all the space, and it depends on the size of the model that you’re using. A seven-billion parameter model, we generally say you need at least eight gigs of memory, eight gigs of either unified memory on the Mac or memory on a Linux box. And we’re trying to, going to try to push as much of that into the video memory as possible. Really the LLM model itself takes up maybe three and a half gigs, but you know, there’s the operating system that also has to run. We need the OS in there as well. So that needs some overhead as well. That’s why we say eight gigs for the seven-billion parameter. For the 13-billion, we say either 16 or 32 gigs, and the 34-billion parameter, we definitely say 32 gigs, and the 70-billion parameter, we definitely say 64 gigs. These are all super rough numbers because every model is a little bit different, every quantization is a little bit different.

    Was there another part to that question? CPU power? Yeah, so the other part of the question was what’s the CPU that you need?

    If you don’t have a GPU, you better have a really fast CPU, because it’s going to be better on GPU. You know, there’s two parts to answering a question or answering a prompt in Ollama. There’s the first part where we just try to process the prompt itself and figure out what’s in the prompt, tokenize that prompt, and that’s CPU bound. Even if you have the most amazing GPU, it’s still going to use purely the CPU. But it’s pretty fast. And then the actual, you know, generating the answer, that’s all GPU. So if you don’t have a GPU, it’s going to be slower. But it’s better with a GPU and the bigger the better, newer the better, it’s hard to say. I have a two-year-old M1 Mac with 64 gigs of memory. And a seven-billion parameter model, I’m getting 60 tokens per second, which is a really good response. Your experience may vary. Anything else?

    I saw you can alter the process a little bit. Do you think of making Ollama talk to each other? Actually, not like talk talk — like the models communicate.

    So, can we get the models to talk to each other? That’s kind of what I was trying to do with that mentors example. Sometimes you actually see one conversation bleed into the next conversation. But it’s not something that Olama does. I mean, it’s not something that we have built into the product. But it’s something that you would build in whatever code that you write. Yeah. So take either the answer that comes back, we also provide a context, which is basically, the answer in the previous conversation, all the previous questions, already tokenized. And you can hand that to the next prompt to the next model. But it’s something that you need to manage as a developer.

    So, in your prompt, like your Docker file, your model file that you were building, you can give it a single prompt as to what it is. Can you give it multiple prompts, so I have different use cases for a single model, or do I need a separate model for each one?

    Can you have multiple prompts? Depending on the question that comes in, go to one of those different prompts. And, no, you have one system prompt, and one system prompt per model. But, that said, if you’re building a tool, you know, you’re using TypeScript or Python to build or any language to build some application, it would be up to you to process that question that’s coming in. And then maybe you always send it to Llama 2, but in the code, you can give it a new system prompt, based on what kind of question was asked. You can change these things at runtime. Ideally, you have it in the model file, but you can change these things at runtime in the code, whatever code you’re writing. Okay, looks like I’m over time. Thanks so much.

    Learn more

    This article contains the YouTube transcript of a presentation from DockerCon 2023. “AI Anytime, Anywhere: Getting Started with LLMs on Your Laptop Now” was presented by Matt Williams, Maintainer, Ollama.ai.

    Find a subscription that’s right for you

    Contact an expert today to find the perfect balance of collaboration, security, and support with a Docker subscription.