Build your first AI model — without the overwhelm.
You don't need a PhD, a supercomputer, or years of experience. With an open-source model like Mistral and a free tool called Ollama, you can have an AI running on your own computer today. This guide walks you through it, one calm step at a time.
How to use this site.
Nine tabs, roughly in the order you'd want them. Jump around freely — each stands on its own.
Overview · How It Works · Technical
What a model is, how Mistral turns words into replies, and — if you want it — the real architecture under the hood.
Build It · Train It
Get Mistral running on your computer step by step, then (later) adapt it to your own data with fine-tuning.
Applications · Hardware · Glossary · Help
Who uses this and why, what your machine needs, every term defined, and fixes for common snags.
First, what even is a model?
Before touching anything, let's clear up the words. Two minutes here saves hours later.
The "brain"
A large file of learned patterns. Mistral is one of these, trained on huge amounts of text. On its own it just sits there as data, waiting to be run.
The "engine"
Software that loads the model and lets you talk to it. Ollama is the friendliest one for beginners — download, click, done.
The "face"
How a person interacts — a terminal, a chat box, a web page. You add this last, once the brain and engine are working.
"Making a model" means three very different things.
Most beginners mix these up. Here's what's realistic to start with — and what to save for later.
Using a model as-is
Take Mistral exactly as it comes and put it to work. This is where you start, and it's genuinely powerful on its own. → the Build It tab.
Fine-tuning
Teaching an existing model your own style or data. Very doable as a second project using a free online GPU. → the Train It tab.
Training from scratch
Building a model from nothing. This costs millions and needs warehouses of hardware — not a beginner project, and that's fine.
When people say "we built an AI," they almost always mean A — wrapping a ready-made model in something useful. That's a brilliant first project. Do that first, then graduate to B when you're curious.
What you'll need on your machine.
Nothing exotic. Most laptops from the last few years can do this.
Windows 10 or 11
Windows 10 needs version 1903 or newer. Windows 11 is fine as-is. Mac and Linux work too.
8 GB minimum
8 GB runs Mistral 7B; 16 GB is comfier for doing other things at the same time.
~10 GB
The model file is about 4 GB; leave headroom. An SSD loads models noticeably faster than an old hard drive.
A graphics card (GPU) is optional. Without one, Mistral still runs on your CPU — just slower, generating a few words per second. That's completely fine for learning and testing. If responses feel sluggish, that's why, and it's not a mistake on your part.
How Mistral turns your words into a reply.
The gentle version — no math, just the core idea. (The Technical tab goes deeper if you want it.) Understanding this makes every later choice make sense.
At its heart, Mistral is a very sophisticated next-word predictor. It was trained on an enormous amount of text and learned the statistical patterns of language — which words and ideas tend to follow which. When you type a prompt, it predicts the most fitting next piece of text, then the next, then the next, building its answer one small chunk (a "token") at a time.
It isn't looking up stored answers or copying from a database — it's generating, guided by the patterns it absorbed in training. That's why it can write things it has never seen before, and also why it can occasionally state something wrong with total confidence: it's predicting plausible text, not consulting facts.
What happens when you press Enter
Every reply flows through these stages, in a fraction of a second per token.
Tokenize
Your text is split into tokens — roughly word-pieces the model can work with.
Embed
Each token becomes a list of numbers capturing its meaning and position.
Attention
The model weighs which earlier tokens matter most for what comes next.
Predict
It outputs a probability for every possible next token and picks one.
Repeat
The chosen token is added and the loop runs again — until the reply is done.
Because step 4 produces probabilities, the model can pick the single most likely token (predictable, sometimes dull) or sample a little more loosely (more varied, more creative). That dial is called temperature — low for focused factual answers, higher for brainstorming. Same model, different flavor of output.
That's the whole engine. Everything Mistral does — answering, summarizing, coding, role-play — is this predict-a-token-and-repeat loop running very fast. The Technical tab explains the clever tricks (grouped-query attention, sliding windows) that make Mistral do this efficiently enough to run on your laptop.
Get Mistral running, step by step.
Five small steps, each with a checkpoint so you always know it worked before moving on. Do it once yourself before teaching anyone — that's what makes it click.
Install Ollama (the engine that runs the model), then type one command that downloads and starts Mistral. That's the whole job — everything below is just doing it carefully.
Install Ollama
- Go to ollama.com/download and get the Windows installer.
- Run it — double-click and click through, no settings to change. Tip: right-click → "Run as administrator" avoids path issues.
- Ollama now runs quietly in the background.
Ollama is the "engine" from the Overview tab. It handles all the hard parts of loading and running a model for you.
Open your command line
- Press the Windows key.
- Type
PowerShelland press Enter. - A dark window opens — that's where the next commands go. (If you installed as administrator, open a fresh window so it picks up the change.)
Download and run Mistral
Type this one line and press Enter. The first run downloads the model (about 4 GB — give it a few minutes); after that it's instant.
PS> ollama run mistral pulling manifest... success — talk to your model below >>> Hello! Who are you?
>>> prompt — the model is alive and waiting.Chat with it, then take notes
Type a question and press Enter. That's it — you're running your own AI.
- Ask it to write, explain, brainstorm — notice what it's good and bad at.
- Jot down what surprises you; those notes become your teaching material.
- Type
/bye(or close the window) to stop. Runollama run mistralagain anytime — no re-download.
Add a simple web face (optional)
Once it works in the terminal, you can put a real chat box in front of it with a free Python tool like Gradio or Streamlit — a few lines of code becomes a working app.
# install once PS> pip install gradio ollama # ~10 lines later, a web chat box opens in your browser PS> python app.py Running on http://127.0.0.1:7860
The natural next milestone once you're comfortable in the terminal.
Under the hood: how Mistral actually works.
A deeper look than the rest of the guide, with the real architecture terms. You don't need any of this to run the model — it's here for when you want to understand what you're running.
A language model like Mistral is, at heart, a next-token predictor. Text is chopped into "tokens" (roughly word-pieces); the model reads the tokens so far and outputs a probability for every possible next token, picks one, appends it, and repeats. Everything it does — answering, coding, reasoning — is that loop running fast. It isn't looking answers up; it's generating them from statistical patterns learned in training, which is also why it can occasionally be confidently wrong.
The architecture underneath is a decoder-only Transformer — the same broad family as GPT and Llama. What makes Mistral notable is a set of efficiency innovations that let a 7-billion-parameter model match much larger ones. Here are the real ones:
- Base
- Decoder-only Transformer, ~7.3B parameters, released under the open Apache 2.0 license.
- Normalization
- RMSNorm for stable, efficient layer normalization.
- Position encoding
- RoPE (Rotary Position Embeddings) — encodes token position by rotation, generalizing well to longer sequences.
- Feed-forward
- SwiGLU/SiLU activations, replacing the original Transformer's ReLU for better quality.
Grouped-Query Attention (GQA)
In standard attention, every "query head" has its own "key" and "value" heads — accurate but memory-hungry. The opposite extreme (one shared key/value) is fast but lower quality. GQA splits the difference: query heads are grouped, and each group shares one key/value head. The result is much faster inference and a smaller memory footprint during generation, with almost no quality loss. This is a big part of why Mistral feels snappy even on modest hardware.
Sliding Window Attention (SWA)
Normally every token attends to all previous tokens, so cost grows steeply with length. Mistral instead lets each token attend only to a fixed window of recent tokens (e.g. 4096). The clever part: because Transformers stack layers, information still propagates further than the window — a token in a higher layer indirectly "sees" tokens up to window × layers back. So you get most of the reach of full attention at a fraction of the compute.
Rolling Buffer KV Cache
During generation the model caches the keys and values it has computed so it doesn't redo work. Paired with the sliding window, that cache can be a fixed-size rolling buffer: once it's full, the oldest entry is overwritten rather than the cache growing forever. Memory use stays flat no matter how long the conversation gets — another reason it runs comfortably on a laptop.
Mixtral: Sparse Mixture-of-Experts (MoE)
Mistral's bigger sibling, Mixtral 8×7B, swaps each feed-forward block for 8 parallel "experts" plus a small router that picks just 2 of them per token. So although Mixtral holds ~47B parameters total, only about 13B are active for any given token — you get the knowledge of a large model at the inference cost of a much smaller one. This same "total vs. active parameters" idea now appears across many frontier models.
Everything in this guide is running and configuring an existing model. Fine-tuning (the Train It tab) adapts one to your data and is the realistic next step. Training from scratch — what Mistral AI did to create these — costs millions in compute and isn't a hobby project. Knowing where that line sits is part of understanding the technology honestly.
Will your computer handle this?
Short answer: almost certainly yes, to start. Here's the honest breakdown so you know where you stand and what (if anything) is worth upgrading later.
If your computer has 16 GB of RAM, you can run Mistral 7B and even do free fine-tuning in the cloud. You probably already have enough to begin. Don't buy anything until you've hit a real wall.
One idea makes sense of all the hardware talk: a model runs fast when it fits entirely in fast memory, and slow when it doesn't. A graphics card's memory (VRAM) is the fastest. Your system RAM is the fallback. When a model is too big for what's available, it "spills over" and slows down a lot — often 10 times slower. That's the whole game. Everything below is just detail on that one rule.
System RAM, tier by tier
This is your computer's main memory — the number most laptops advertise. Here's what each level lets you do.
8 GB — the bare entry
Runs a 7B model like Mistral, but it'll be tight and you won't want much else open. Fine for a first taste; you'll feel the squeeze quickly.
16 GB — the comfortable start
The realistic sweet spot for beginners. Runs 7–8B models smoothly and lets you keep a browser and notes open at the same time. If you're buying nothing, aim to at least have this.
32 GB — breathing room
Comfortably handles bigger models and heavier multitasking. A great target if you're buying a machine you want to grow into without overspending.
64 GB — serious headroom
For large models or running several things at once. More than a first project needs — don't pay for this unless you know you'll use it.
The graphics card (GPU)
Optional for running, the big speed-up for training. This is what turns "a few words per second" into "instant."
Runs on CPU
Totally fine for learning. Mistral still works, just slower — a few words per second. Most laptops are here, and that's okay to start.
The sweet spot
An NVIDIA card in this range (e.g. an RTX 4060-class) runs 7–8B models fast and is the most practical target if you choose to buy.
Room to grow
Runs larger models and makes a genuinely capable fine-tuning machine. More than you need on day one, but future-proof.
On Windows, an NVIDIA graphics card has the smoothest software support. On Mac, Apple Silicon (M-series) chips share memory between the system and graphics, so a 32–64 GB Mac can punch above its weight. Both are fully supported by Ollama — it's a preference, not a right-or-wrong.
You can do the entire learning journey — running Mistral and fine-tuning on Google Colab's free GPU — without buying anything. Start on what you own. Buy hardware only after you've confirmed you're hooked and hit a real limit. That's the order that saves money.
Teaching the model your data.
Once running Mistral feels easy, this is the exciting next step: shaping it toward a voice, a topic, or a task you care about.
First, the honest truth about words. "Training from scratch" — building a brain from nothing — is not what you'll do, and you shouldn't want to; it costs millions. What you'll actually do is fine-tuning: taking the smart model that already exists and nudging it with your own examples so it leans in a direction you choose. Think of it as coaching a talented employee, not raising a child from birth.
Instead of rewriting the whole 7-billion-parameter brain (which needs monstrous hardware), a technique called LoRA trains a tiny set of "adapter" layers on top — like sticky notes on a textbook rather than rewriting the book. This is what makes fine-tuning possible on free, everyday hardware.
The beginner's path, in five honest steps
Decide what you're teaching
Be specific and small. "Answer questions about our school's rules in a friendly tone" beats "be smarter." A narrow goal is far easier to reach and to test.
Build a small dataset
Fine-tuning learns from examples, usually pairs of "here's an input, here's the ideal response." Even 50–200 good examples can teach a style or a topic. You and your teammate can write these in a simple file. Quality beats quantity every time.
{"instruction": "When does the library close?", "output": "The library closes at 9pm on weekdays!"} {"instruction": "Can I bring food inside?", "output": "Drinks with lids are fine, but please no hot food."}
Borrow a free GPU
You don't need to buy hardware. Google Colab gives you a free graphics card in your browser. This is genuinely the part that makes the whole thing accessible to beginners — no purchase, no setup, just a web page.
Use a ready-made notebook
You don't write the training code from scratch. A free tool called Unsloth publishes beginner notebooks where you essentially drop in your dataset and click "Run All." It's built to be fast and to fit inside Colab's free tier. Their ready-made Mistral 7B notebook is here: Unsloth Mistral 7B Colab notebook, and the full collection is at github.com/unslothai/notebooks.
Your first run might take 30–60 minutes and may hit a few errors — that's normal and part of learning. Change one thing at a time, re-run, repeat. This is where you'll learn the most.
Bring it home to Ollama
Here's the satisfying part: the notebook can export your fine-tuned model in a format (called GGUF) that Ollama understands. You copy it back to your computer, register it with Ollama, and now ollama run launches your custom model — the same simple workflow from the Build It tab, but it's yours.
Train in the cloud (free GPU) → export → run locally with the exact same commands you already learned. Nothing you learned in Build It goes to waste.
Before full fine-tuning, try a system prompt — a few sentences telling the model how to behave ("You are a friendly library assistant. Keep answers short."). It's free, instant, and often gets you 80% of the way. Reach for fine-tuning only when prompting isn't enough.
What a local model like Mistral is actually used for.
Running an AI on your own machine isn't just a hobby exercise — it solves real problems for individuals and companies. Here's the practical landscape, plus project ideas to try yourself.
People choose a local open model over a cloud AI service for three concrete reasons: privacy (your data never leaves your machine — decisive for sensitive or regulated work), cost (no per-message API fees; run it as much as you like for the price of electricity), and control (it works offline and can be customized). These drivers explain everything below.
Writing & editing
Drafting emails, summarizing long documents, rewriting and proofreading — fully offline and private.
Coding help
Explaining code, generating snippets, debugging. A private pair-programmer with no subscription.
Language practice
A patient partner to practice a new language with — it corrects gently and never tires.
Learning & tutoring
Explaining concepts, working through problems, generating study questions — a tutor on your laptop.
Private notes & journaling
Organizing or querying personal writing you'd never want uploaded to anyone's servers.
Brainstorming
Ideas, outlines, planning — a thinking partner that's available offline with no usage meter running.
For organizations, the privacy angle becomes a hard requirement, not a preference. Healthcare providers, law firms, and government contractors often cannot send patient records, legal files, or sensitive data to a third-party cloud API. Running a model locally means the data never leaves their network — frequently the only compliant way to use AI at all.
The same logic recurs everywhere: a small open model run locally handles high-volume, privacy-sensitive, or cost-sensitive work, while teams reserve big cloud services for the hardest reasoning. Most real setups are hybrid — and "run it yourself," which this guide teaches, is the foundation.
Projects to try yourself
Want to build rather than just understand? These are great first projects, sorted by how approachable they are.
A themed chatbot
A chat box with a personality you design via system prompt — a study buddy, recipe helper, or polite support demo. The classic, satisfying first build.
A writing assistant
Paste rough notes, get back a tidy email or summary. You can judge quality instantly and it's useful day to day.
"Chat with your documents"
Feed in a PDF and ask questions about it. This pattern is called RAG — the model reads your files before answering. Hugely practical.
An auto-sorter / tagger
Feed it messages or reviews and have it label them (topic, sentiment, urgency). A gentle intro to using AI on data instead of chat.
A quiz generator
Give it a topic or your notes; it writes practice questions and checks answers. A great fine-tuning showcase.
A mini help desk
Fine-tune on a club, school, or small business's FAQs so it answers in the right voice. The natural payoff of the Train It tab.
Pick the smallest project you'd personally find useful or fun. Real motivation beats an impressive-sounding idea you abandon — and you can always grow it once the first version works.
The jargon, demystified.
Every scary word on this site, in one place, explained like you're a smart friend — not a computer science exam.
- LLM
- "Large Language Model." The kind of AI that reads and writes text. Mistral is one. The "large" refers to how much it learned, not its file size.
- Mistral
- A family of free, open-source LLMs made by a French company. "Open" means anyone can download and run it — that's why it's perfect for learning.
- Ollama
- The free app that downloads and runs models on your computer with one command. The "engine" that powers everything here.
- Parameters (e.g. "7B")
- The model's adjustable knobs, learned during training. "7B" = 7 billion of them. More usually means smarter but heavier to run.
- Prompt
- What you type to the model. A "system prompt" is a hidden instruction that sets its behavior before the conversation starts.
- Token
- A chunk of text the model reads and writes — roughly ¾ of a word. "Tokens per second" is how speed is measured.
- Fine-tuning
- Adjusting an existing model with your own examples so it leans toward your style or topic. Coaching, not rebuilding.
- LoRA / QLoRA
- A clever shortcut that fine-tunes a small "adapter" instead of the whole model — so it fits on free hardware. QLoRA is the memory-saving version.
- GPU
- A graphics card. Great at the math AI needs, so it makes models much faster. Optional for running, very helpful for training.
- RAM / VRAM
- Your computer's working memory (RAM) and your graphics card's memory (VRAM). Models need enough of it to fit while running.
- Quantization
- Shrinking a model by storing its numbers less precisely. Slightly less sharp, but far smaller and faster — how a 7B model fits on a laptop.
- RAG
- "Retrieval-Augmented Generation." The model looks things up in your documents before answering, so it can talk about your specific stuff.
- GGUF
- A model file format Ollama can run. When you fine-tune in the cloud, you export to GGUF to bring the result home.
- Google Colab
- A free website that gives you a borrowed GPU inside your browser. Where beginners do fine-tuning without buying hardware.
- Gradio / Streamlit
- Free Python tools that turn a few lines of code into a working web page with buttons and chat boxes. How you give your model a "face."
- Inference
- The fancy word for "the model actually running and producing an answer." Training is learning; inference is doing.
- Transformer
- The neural-network design behind almost all modern language models, including Mistral. "Decoder-only" is the text-generation variant.
- Tokenizer
- The component that splits your text into tokens before the model reads it, and stitches tokens back into text on the way out.
- Embedding
- Turning each token into a list of numbers that captures its meaning, so the model can do math with language.
- Attention
- The mechanism that lets the model weigh which earlier words matter most when predicting the next one — the core of a Transformer.
- GQA (Grouped-Query Attention)
- A Mistral efficiency trick where attention heads share key/value data, cutting memory use and speeding up replies.
- Sliding Window Attention
- Each token attends only to a recent window of tokens rather than all of them — far cheaper, with reach extended across stacked layers.
- Temperature
- A dial for randomness in the model's word choices. Low = focused and predictable; high = varied and creative.
- System prompt
- Hidden instructions that set the model's behavior or persona before your conversation begins (e.g. "You are a friendly tutor").
- Mixture-of-Experts (MoE)
- An architecture (used by Mixtral) with many "expert" sub-networks where only a few activate per token — big capacity, smaller running cost.
- Context window
- How much text the model can consider at once — your prompt plus its reply. Measured in tokens.
- Open weights / Apache 2.0
- Mistral's weights are downloadable under a permissive license — free to use, modify, and even use commercially.
Common snags, and how to fix them.
Everyone hits these. They're not failures — they're rites of passage. Tap a question to open it.
"ollama is not recognized" in PowerShell
The model is painfully slow
The download keeps failing or stalling
"Out of memory" errors
Colab disconnected in the middle of training
My fine-tuned model didn't really change
How do my teammate and I work on this together?
Is any of this going to cost money?
The "we did it" checklist.
Keep this open and tick each box with your teammate.
- Both of us installed Ollama on our computers
- We opened PowerShell and ran ollama run mistral
- We saw the >>> prompt and chatted with the model
- We asked it 5+ different things and noted what it did well
- We picked one idea from the Applications tab to aim for
- (Bonus) We tried writing a system prompt to change its personality
- (Stretch) We read through the Train It tab together
Everything you need, linked.
The official sources for every tool in this guide. Links open in a new tab. Tools update often, so these official pages will always have the current version.
ollama run mistral command and version tags.You're closer than you think.
The hardest part of any project is the first command. Open PowerShell, type one line, and you've already started. Everything after that is just curiosity.
Get Ollama → start now