A guide for complete beginners

Build your first AI model — without the overwhelm.

You don't need a PhD, a supercomputer, or years of experience. With an open-source model like Mistral and a free tool called Ollama, you can have an AI running on your own computer today. This guide walks you through it, one calm step at a time.

100% free to start Works on Windows No prior experience Nothing to break

installrunexperimenttrainbuildshareinstallrunexperimenttrainbuildshare

— 00 / What's inside this guide

How to use this site.

Nine tabs, roughly in the order you'd want them. Jump around freely — each stands on its own.

Understand

Overview · How It Works · Technical

What a model is, how Mistral turns words into replies, and — if you want it — the real architecture under the hood.

Build It · Train It

Get Mistral running on your computer step by step, then (later) adapt it to your own data with fine-tuning.

Go further

Applications · Hardware · Glossary · Help

Who uses this and why, what your machine needs, every term defined, and fixes for common snags.

— 01 / The big picture

First, what even is a model?

Before touching anything, let's clear up the words. Two minutes here saves hours later.

The model

The "brain"

A large file of learned patterns. Mistral is one of these, trained on huge amounts of text. On its own it just sits there as data, waiting to be run.

The runner

The "engine"

Software that loads the model and lets you talk to it. Ollama is the friendliest one for beginners — download, click, done.

The interface

The "face"

How a person interacts — a terminal, a chat box, a web page. You add this last, once the brain and engine are working.

— 02 / Setting expectations

"Making a model" means three very different things.

Most beginners mix these up. Here's what's realistic to start with — and what to save for later.

Using a model as-is

Take Mistral exactly as it comes and put it to work. This is where you start, and it's genuinely powerful on its own. → the Build It tab.

Start here

Fine-tuning

Teaching an existing model your own style or data. Very doable as a second project using a free online GPU. → the Train It tab.

Project two

Training from scratch

Building a model from nothing. This costs millions and needs warehouses of hardware — not a beginner project, and that's fine.

Not now

Good news

When people say "we built an AI," they almost always mean A — wrapping a ready-made model in something useful. That's a brilliant first project. Do that first, then graduate to B when you're curious.

— 03 / Before you start

What you'll need on your machine.

Nothing exotic. Most laptops from the last few years can do this.

Operating system

Windows 10 or 11

Windows 10 needs version 1903 or newer. Windows 11 is fine as-is. Mac and Linux work too.

Memory (RAM)

8 GB minimum

8 GB runs Mistral 7B; 16 GB is comfier for doing other things at the same time.

Free disk space

~10 GB

The model file is about 4 GB; leave headroom. An SSD loads models noticeably faster than an old hard drive.

Heads up

A graphics card (GPU) is optional. Without one, Mistral still runs on your CPU — just slower, generating a few words per second. That's completely fine for learning and testing. If responses feel sluggish, that's why, and it's not a mistake on your part.

— How It Works / The mechanism

How Mistral turns your words into a reply.

The gentle version — no math, just the core idea. (The Technical tab goes deeper if you want it.) Understanding this makes every later choice make sense.

At its heart, Mistral is a very sophisticated next-word predictor. It was trained on an enormous amount of text and learned the statistical patterns of language — which words and ideas tend to follow which. When you type a prompt, it predicts the most fitting next piece of text, then the next, then the next, building its answer one small chunk (a "token") at a time.

It isn't looking up stored answers or copying from a database — it's generating, guided by the patterns it absorbed in training. That's why it can write things it has never seen before, and also why it can occasionally state something wrong with total confidence: it's predicting plausible text, not consulting facts.

What happens when you press Enter

Every reply flows through these stages, in a fraction of a second per token.

STAGE 1

Tokenize

Your text is split into tokens — roughly word-pieces the model can work with.

STAGE 2

Embed

Each token becomes a list of numbers capturing its meaning and position.

STAGE 3

Attention

The model weighs which earlier tokens matter most for what comes next.

STAGE 4

Predict

It outputs a probability for every possible next token and picks one.

STAGE 5

Repeat

The chosen token is added and the loop runs again — until the reply is done.

Why "temperature" and randomness exist

Because step 4 produces probabilities, the model can pick the single most likely token (predictable, sometimes dull) or sample a little more loosely (more varied, more creative). That dial is called temperature — low for focused factual answers, higher for brainstorming. Same model, different flavor of output.

That's the whole engine. Everything Mistral does — answering, summarizing, coding, role-play — is this predict-a-token-and-repeat loop running very fast. The Technical tab explains the clever tricks (grouped-query attention, sliding windows) that make Mistral do this efficiently enough to run on your laptop.

— Build / Hands on

Get Mistral running, step by step.

Five small steps, each with a checkpoint so you always know it worked before moving on. Do it once yourself before teaching anyone — that's what makes it click.

What you're doing

Install Ollama (the engine that runs the model), then type one command that downloads and starts Mistral. That's the whole job — everything below is just doing it carefully.

Install Ollama

Go to ollama.com/download and get the Windows installer.
Run it — double-click and click through, no settings to change. Tip: right-click → "Run as administrator" avoids path issues.
Ollama now runs quietly in the background.

What this does

Ollama is the "engine" from the Overview tab. It handles all the hard parts of loading and running a model for you.

Open your command line

Press the Windows key.
Type PowerShell and press Enter.
A dark window opens — that's where the next commands go. (If you installed as administrator, open a fresh window so it picks up the change.)

Check: a PowerShell window is open and waiting.

Download and run Mistral

Type this one line and press Enter. The first run downloads the model (about 4 GB — give it a few minutes); after that it's instant.

Windows PowerShell

PS> ollama run mistral
pulling manifest...
success — talk to your model below
>>> Hello! Who are you?

Check: you see the >>> prompt — the model is alive and waiting.

Chat with it, then take notes

Type a question and press Enter. That's it — you're running your own AI.

Ask it to write, explain, brainstorm — notice what it's good and bad at.
Jot down what surprises you; those notes become your teaching material.
Type /bye (or close the window) to stop. Run ollama run mistral again anytime — no re-download.

Add a simple web face (optional)

Once it works in the terminal, you can put a real chat box in front of it with a free Python tool like Gradio or Streamlit — a few lines of code becomes a working app.

a taste of what's ahead

# install once
PS> pip install gradio ollama

# ~10 lines later, a web chat box opens in your browser
PS> python app.py
Running on http://127.0.0.1:7860

The natural next milestone once you're comfortable in the terminal.

Advanced — for the technically curious

Under the hood: how Mistral actually works.

A deeper look than the rest of the guide, with the real architecture terms. You don't need any of this to run the model — it's here for when you want to understand what you're running.

A language model like Mistral is, at heart, a next-token predictor. Text is chopped into "tokens" (roughly word-pieces); the model reads the tokens so far and outputs a probability for every possible next token, picks one, appends it, and repeats. Everything it does — answering, coding, reasoning — is that loop running fast. It isn't looking answers up; it's generating them from statistical patterns learned in training, which is also why it can occasionally be confidently wrong.

The architecture underneath is a decoder-only Transformer — the same broad family as GPT and Llama. What makes Mistral notable is a set of efficiency innovations that let a 7-billion-parameter model match much larger ones. Here are the real ones:

Base: Decoder-only Transformer, ~7.3B parameters, released under the open Apache 2.0 license.
Normalization: RMSNorm for stable, efficient layer normalization.
Position encoding: RoPE (Rotary Position Embeddings) — encodes token position by rotation, generalizing well to longer sequences.
Feed-forward: SwiGLU/SiLU activations, replacing the original Transformer's ReLU for better quality.

Grouped-Query Attention (GQA)

In standard attention, every "query head" has its own "key" and "value" heads — accurate but memory-hungry. The opposite extreme (one shared key/value) is fast but lower quality. GQA splits the difference: query heads are grouped, and each group shares one key/value head. The result is much faster inference and a smaller memory footprint during generation, with almost no quality loss. This is a big part of why Mistral feels snappy even on modest hardware.

Sliding Window Attention (SWA)

Normally every token attends to all previous tokens, so cost grows steeply with length. Mistral instead lets each token attend only to a fixed window of recent tokens (e.g. 4096). The clever part: because Transformers stack layers, information still propagates further than the window — a token in a higher layer indirectly "sees" tokens up to window × layers back. So you get most of the reach of full attention at a fraction of the compute.

Rolling Buffer KV Cache

During generation the model caches the keys and values it has computed so it doesn't redo work. Paired with the sliding window, that cache can be a fixed-size rolling buffer: once it's full, the oldest entry is overwritten rather than the cache growing forever. Memory use stays flat no matter how long the conversation gets — another reason it runs comfortably on a laptop.

Mixtral: Sparse Mixture-of-Experts (MoE)

Mistral's bigger sibling, Mixtral 8×7B, swaps each feed-forward block for 8 parallel "experts" plus a small router that picks just 2 of them per token. So although Mixtral holds ~47B parameters total, only about 13B are active for any given token — you get the knowledge of a large model at the inference cost of a much smaller one. This same "total vs. active parameters" idea now appears across many frontier models.

The honest boundary of "build your own model"

Everything in this guide is running and configuring an existing model. Fine-tuning (the Train It tab) adapts one to your data and is the realistic next step. Training from scratch — what Mistral AI did to create these — costs millions in compute and isn't a hobby project. Knowing where that line sits is part of understanding the technology honestly.

— Hardware / What you need

Will your computer handle this?

Short answer: almost certainly yes, to start. Here's the honest breakdown so you know where you stand and what (if anything) is worth upgrading later.

The one-line version

If your computer has 16 GB of RAM, you can run Mistral 7B and even do free fine-tuning in the cloud. You probably already have enough to begin. Don't buy anything until you've hit a real wall.

One idea makes sense of all the hardware talk: a model runs fast when it fits entirely in fast memory, and slow when it doesn't. A graphics card's memory (VRAM) is the fastest. Your system RAM is the fallback. When a model is too big for what's available, it "spills over" and slows down a lot — often 10 times slower. That's the whole game. Everything below is just detail on that one rule.

System RAM, tier by tier

This is your computer's main memory — the number most laptops advertise. Here's what each level lets you do.

8 GB — the bare entry

Runs a 7B model like Mistral, but it'll be tight and you won't want much else open. Fine for a first taste; you'll feel the squeeze quickly.

Workable

16 GB — the comfortable start

The realistic sweet spot for beginners. Runs 7–8B models smoothly and lets you keep a browser and notes open at the same time. If you're buying nothing, aim to at least have this.

Recommended

32 GB — breathing room

Comfortably handles bigger models and heavier multitasking. A great target if you're buying a machine you want to grow into without overspending.

Great

64 GB — serious headroom

For large models or running several things at once. More than a first project needs — don't pay for this unless you know you'll use it.

Overkill to start

The graphics card (GPU)

Optional for running, the big speed-up for training. This is what turns "a few words per second" into "instant."

No dedicated GPU

Runs on CPU

Totally fine for learning. Mistral still works, just slower — a few words per second. Most laptops are here, and that's okay to start.

8–12 GB VRAM

The sweet spot

An NVIDIA card in this range (e.g. an RTX 4060-class) runs 7–8B models fast and is the most practical target if you choose to buy.

16–24 GB VRAM

Room to grow

Runs larger models and makes a genuinely capable fine-tuning machine. More than you need on day one, but future-proof.

Two viable paths if you do buy

On Windows, an NVIDIA graphics card has the smoothest software support. On Mac, Apple Silicon (M-series) chips share memory between the system and graphics, so a 32–64 GB Mac can punch above its weight. Both are fully supported by Ollama — it's a preference, not a right-or-wrong.

Before you spend a cent

You can do the entire learning journey — running Mistral and fine-tuning on Google Colab's free GPU — without buying anything. Start on what you own. Buy hardware only after you've confirmed you're hooked and hit a real limit. That's the order that saves money.

— Train / Project two

Teaching the model your data.

Once running Mistral feels easy, this is the exciting next step: shaping it toward a voice, a topic, or a task you care about.

First, the honest truth about words. "Training from scratch" — building a brain from nothing — is not what you'll do, and you shouldn't want to; it costs millions. What you'll actually do is fine-tuning: taking the smart model that already exists and nudging it with your own examples so it leans in a direction you choose. Think of it as coaching a talented employee, not raising a child from birth.

The key idea: LoRA

Instead of rewriting the whole 7-billion-parameter brain (which needs monstrous hardware), a technique called LoRA trains a tiny set of "adapter" layers on top — like sticky notes on a textbook rather than rewriting the book. This is what makes fine-tuning possible on free, everyday hardware.

The beginner's path, in five honest steps

Decide what you're teaching

Be specific and small. "Answer questions about our school's rules in a friendly tone" beats "be smarter." A narrow goal is far easier to reach and to test.

Build a small dataset

Fine-tuning learns from examples, usually pairs of "here's an input, here's the ideal response." Even 50–200 good examples can teach a style or a topic. You and your teammate can write these in a simple file. Quality beats quantity every time.

example: a few rows of training data

{"instruction": "When does the library close?",
 "output": "The library closes at 9pm on weekdays!"}
{"instruction": "Can I bring food inside?",
 "output": "Drinks with lids are fine, but please no hot food."}

Borrow a free GPU

You don't need to buy hardware. Google Colab gives you a free graphics card in your browser. This is genuinely the part that makes the whole thing accessible to beginners — no purchase, no setup, just a web page.

Use a ready-made notebook

You don't write the training code from scratch. A free tool called Unsloth publishes beginner notebooks where you essentially drop in your dataset and click "Run All." It's built to be fast and to fit inside Colab's free tier. Their ready-made Mistral 7B notebook is here: Unsloth Mistral 7B Colab notebook, and the full collection is at github.com/unslothai/notebooks.

Set expectations

Your first run might take 30–60 minutes and may hit a few errors — that's normal and part of learning. Change one thing at a time, re-run, repeat. This is where you'll learn the most.

Bring it home to Ollama

Here's the satisfying part: the notebook can export your fine-tuned model in a format (called GGUF) that Ollama understands. You copy it back to your computer, register it with Ollama, and now ollama run launches your custom model — the same simple workflow from the Build It tab, but it's yours.

Full circle

Train in the cloud (free GPU) → export → run locally with the exact same commands you already learned. Nothing you learned in Build It goes to waste.

A gentler alternative first

Before full fine-tuning, try a system prompt — a few sentences telling the model how to behave ("You are a friendly library assistant. Keep answers short."). It's free, instant, and often gets you 80% of the way. Reach for fine-tuning only when prompting isn't enough.

— Applications / Who uses this and why

What a local model like Mistral is actually used for.

Running an AI on your own machine isn't just a hobby exercise — it solves real problems for individuals and companies. Here's the practical landscape, plus project ideas to try yourself.

Why local, specifically

People choose a local open model over a cloud AI service for three concrete reasons: privacy (your data never leaves your machine — decisive for sensitive or regulated work), cost (no per-message API fees; run it as much as you like for the price of electricity), and control (it works offline and can be customized). These drivers explain everything below.

For individuals

✍️

Writing & editing

Drafting emails, summarizing long documents, rewriting and proofreading — fully offline and private.

💻

Coding help

Explaining code, generating snippets, debugging. A private pair-programmer with no subscription.

🌐

Language practice

A patient partner to practice a new language with — it corrects gently and never tires.

📚

Learning & tutoring

Explaining concepts, working through problems, generating study questions — a tutor on your laptop.

🔒

Private notes & journaling

Organizing or querying personal writing you'd never want uploaded to anyone's servers.

🧩

Brainstorming

Ideas, outlines, planning — a thinking partner that's available offline with no usage meter running.

For companies

For organizations, the privacy angle becomes a hard requirement, not a preference. Healthcare providers, law firms, and government contractors often cannot send patient records, legal files, or sensitive data to a third-party cloud API. Running a model locally means the data never leaves their network — frequently the only compliant way to use AI at all.

Healthcare

Summarizing and drafting clinical documentation on-premises, where patient data can't legally go to an outside service.

Legal

Searching and summarizing long contracts and case files (often via RAG), with confidential documents staying inside the firm.

Finance

Drafting reports and compliance checks in settings where data-sovereignty rules forbid external APIs.

Software teams

Code generation, review, and documentation kept in-house so proprietary source never leaves the network.

Customer support

Internal chat assistants and draft-reply tools trained on the company's own knowledge base and tone.

Operations / admin

Turning messy notes, forms, and transcripts into clean structured text — high volume, low cost, on local hardware.

The pattern to notice

The same logic recurs everywhere: a small open model run locally handles high-volume, privacy-sensitive, or cost-sensitive work, while teams reserve big cloud services for the hardest reasoning. Most real setups are hybrid — and "run it yourself," which this guide teaches, is the foundation.

Projects to try yourself

Want to build rather than just understand? These are great first projects, sorted by how approachable they are.

💬

A themed chatbot

A chat box with a personality you design via system prompt — a study buddy, recipe helper, or polite support demo. The classic, satisfying first build.

BeginnerPrompt only

📝

A writing assistant

Paste rough notes, get back a tidy email or summary. You can judge quality instantly and it's useful day to day.

BeginnerWeb face

📚

"Chat with your documents"

Feed in a PDF and ask questions about it. This pattern is called RAG — the model reads your files before answering. Hugely practical.

IntermediateRAG

🏷️

An auto-sorter / tagger

Feed it messages or reviews and have it label them (topic, sentiment, urgency). A gentle intro to using AI on data instead of chat.

IntermediateClassification

🎓

A quiz generator

Give it a topic or your notes; it writes practice questions and checks answers. A great fine-tuning showcase.

IntermediateFine-tune friendly

🏢

A mini help desk

Fine-tune on a club, school, or small business's FAQs so it answers in the right voice. The natural payoff of the Train It tab.

IntermediateFine-tune

How to choose

Pick the smallest project you'd personally find useful or fun. Real motivation beats an impressive-sounding idea you abandon — and you can always grow it once the first version works.

— Glossary / Plain English

The jargon, demystified.

Every scary word on this site, in one place, explained like you're a smart friend — not a computer science exam.

LLM: "Large Language Model." The kind of AI that reads and writes text. Mistral is one. The "large" refers to how much it learned, not its file size.
Mistral: A family of free, open-source LLMs made by a French company. "Open" means anyone can download and run it — that's why it's perfect for learning.
Ollama: The free app that downloads and runs models on your computer with one command. The "engine" that powers everything here.
Parameters (e.g. "7B"): The model's adjustable knobs, learned during training. "7B" = 7 billion of them. More usually means smarter but heavier to run.
Prompt: What you type to the model. A "system prompt" is a hidden instruction that sets its behavior before the conversation starts.
Token: A chunk of text the model reads and writes — roughly ¾ of a word. "Tokens per second" is how speed is measured.
Fine-tuning: Adjusting an existing model with your own examples so it leans toward your style or topic. Coaching, not rebuilding.
LoRA / QLoRA: A clever shortcut that fine-tunes a small "adapter" instead of the whole model — so it fits on free hardware. QLoRA is the memory-saving version.
GPU: A graphics card. Great at the math AI needs, so it makes models much faster. Optional for running, very helpful for training.
RAM / VRAM: Your computer's working memory (RAM) and your graphics card's memory (VRAM). Models need enough of it to fit while running.
Quantization: Shrinking a model by storing its numbers less precisely. Slightly less sharp, but far smaller and faster — how a 7B model fits on a laptop.
RAG: "Retrieval-Augmented Generation." The model looks things up in your documents before answering, so it can talk about your specific stuff.
GGUF: A model file format Ollama can run. When you fine-tune in the cloud, you export to GGUF to bring the result home.
Google Colab: A free website that gives you a borrowed GPU inside your browser. Where beginners do fine-tuning without buying hardware.
Gradio / Streamlit: Free Python tools that turn a few lines of code into a working web page with buttons and chat boxes. How you give your model a "face."
Inference: The fancy word for "the model actually running and producing an answer." Training is learning; inference is doing.
Transformer: The neural-network design behind almost all modern language models, including Mistral. "Decoder-only" is the text-generation variant.
Tokenizer: The component that splits your text into tokens before the model reads it, and stitches tokens back into text on the way out.
Embedding: Turning each token into a list of numbers that captures its meaning, so the model can do math with language.
Attention: The mechanism that lets the model weigh which earlier words matter most when predicting the next one — the core of a Transformer.
GQA (Grouped-Query Attention): A Mistral efficiency trick where attention heads share key/value data, cutting memory use and speeding up replies.
Sliding Window Attention: Each token attends only to a recent window of tokens rather than all of them — far cheaper, with reach extended across stacked layers.
Temperature: A dial for randomness in the model's word choices. Low = focused and predictable; high = varied and creative.
System prompt: Hidden instructions that set the model's behavior or persona before your conversation begins (e.g. "You are a friendly tutor").
Mixture-of-Experts (MoE): An architecture (used by Mixtral) with many "expert" sub-networks where only a few activate per token — big capacity, smaller running cost.
Context window: How much text the model can consider at once — your prompt plus its reply. Measured in tokens.
Open weights / Apache 2.0: Mistral's weights are downloadable under a permissive license — free to use, modify, and even use commercially.

— Help / When things go sideways

Common snags, and how to fix them.

Everyone hits these. They're not failures — they're rites of passage. Tap a question to open it.

"ollama is not recognized" in PowerShell

The install didn't get added to your system path yet. Close PowerShell completely and open a fresh window — that fixes it most of the time. If not, reinstall Ollama by right-clicking the installer and choosing "Run as administrator."

The model is painfully slow

You're almost certainly running on CPU instead of a GPU, which is normal and fine for learning. Responses of a few words per second are expected. To speed up: close other heavy apps, or try a smaller model. It's not broken — just thinking hard.

The download keeps failing or stalling

It's a ~4 GB download, so a shaky connection can interrupt it. Just run ollama run mistral again — it resumes rather than starting over. Make sure you have enough free disk space (around 10 GB to be safe).

"Out of memory" errors

The model needs more RAM (or VRAM) than is free right now. Close other programs, especially browsers with many tabs. If it persists, your machine may be below the 8 GB comfort line for a 7B model — try a smaller model like phi3 instead.

Colab disconnected in the middle of training

Free Colab times out if you're idle or after long sessions. Save your work to Google Drive often, keep the browser tab active, and if it drops, you can usually re-run from your last saved checkpoint. Annoying, but expected on the free tier.

My fine-tuned model didn't really change

Usually the dataset was too small or too inconsistent. Add more examples, make sure they all demonstrate the same behavior clearly, and check they're formatted the same way. Also confirm you actually loaded the fine-tuned version, not the original.

How do my teammate and I work on this together?

Keep your site and any code in a free GitHub account — it lets you both edit without emailing files around, and it keeps a history so nothing is ever truly lost. Netlify can publish straight from GitHub automatically. Split roles: one drives the technical steps while the other writes them up, then swap to check each other.

Is any of this going to cost money?

Running Mistral locally: free. Fine-tuning on Colab's free tier: free. Hosting this guide on Netlify: free. The only thing that costs money is running a model on a cloud server so strangers can use it 24/7 — and you save that for the very end, if ever.

— Help / First session

The "we did it" checklist.

Keep this open and tick each box with your teammate.

Both of us installed Ollama on our computers
We opened PowerShell and ran ollama run mistral
We saw the >>> prompt and chatted with the model
We asked it 5+ different things and noted what it did well
We picked one idea from the Applications tab to aim for
(Bonus) We tried writing a system prompt to change its personality
(Stretch) We read through the Train It tab together

— Help / Resources & downloads

Everything you need, linked.

The official sources for every tool in this guide. Links open in a new tab. Tools update often, so these official pages will always have the current version.

Ollama — download

The engine that runs Mistral locally. Get the Windows installer here. (Build It, step 1.)

ollama.com/download

Mistral model page

The official model listing, with the exact ollama run mistral command and version tags.

ollama.com/library/mistral

Browse all models

Other models you can run with the same one command — Llama, Phi, Gemma, Qwen and more.

ollama.com/search

Google Colab

Free GPU in your browser for fine-tuning — no hardware purchase needed. (Train It, step 3.)

colab.research.google.com

Unsloth notebooks

Ready-made fine-tuning notebooks — add your data, click "Run All." Includes the Mistral 7B notebook.

github.com/unslothai/notebooks

Unsloth docs

Beginner-friendly fine-tuning guides and the full notebook index, kept current.

unsloth.ai/docs

Gradio quickstart

Turn your model into a web chat box in a few lines of Python. (Build It, step 5.)

gradio.app/guides/quickstart

Mistral on Hugging Face

The official source for Mistral's open model weights and model cards.

huggingface.co/mistralai

You're closer than you think.

The hardest part of any project is the first command. Open PowerShell, type one line, and you've already started. Everything after that is just curiosity.

Get Ollama → start now