Does anyone self host Kiwix (offline wikipedia)?

SuspciousCarrot78@lemmy.world · 2 days ago

Does anyone self host Kiwix (offline wikipedia)?

bgrayburn@lemmy.world · 23 hours ago

Checkout internet in a box. Easy to setup, update and select datasources which include wikipedia among many. https://internet-in-a-box.org/

Sonalder@lemmy.ml · 22 hours ago

The full Wikipedia is saved on my mobile phone thanks to Kiwix

DisgruntledGorillaGang@reddthat.com · 1 day ago

Yes, I host Wikipedia, Wiktionary, and a few other resources. Very convenient, and the full Wikipedia is only like 100 gigs.

surfrock66@lemmy.world · 2 days ago

Yes, and I actually use it to train a local llm so I’m not hammering the internet. I have a ton of storage, and like to keep my kids in the sandbox, so we have wikipedia, project gutenberg, kahn academy, and a bunch of others all hosted behind an apache reverse proxy which is using mellon so there’s LDAP auth.

Domi@lemmy.secnd.me · 1 day ago

Do you actually train the LLM or use RAG? I have been looking for a local LLM + Wikipedia RAG solution for a while now.

For now I just have kiwix-serve + searxng doing a simple search but the Kiwix search is…questionable.

SuspciousCarrot78@lemmy.world · edit-2 23 hours ago

Found it. Old chat with ChatGPT. Lemme know if you want the how to

Here’s the clean handoff note.

ZIM / Kiwix idea — summary

Core idea: use Wikipedia ZIMs as a local retrieval substrate to make a small 4B model act smarter without touching weights. The win is not “the model now knows Wikipedia.” The win is that the system can consult a large local corpus before priors, under deterministic rules, with no internet dependency. Kiwix uses ZIM as compressed offline content, and kiwix-serve can expose that content over HTTP; libzim can also address entries directly by title or path. (Kiwix)

Important distinction: Kiwix/ZIM can absolutely be part of an always-on layer, but the true hot path should not be “read full article over HTTP every turn.” The lowest-latency design is to query the archive directly in-process via libzim/python-libzim using exact title/path resolution first, keep the archive handle hot, and lean on the built-in title index plus dirent/cluster caches. kiwix-serve is still useful beside that as a human-facing browser/search layer. (libzim Documentation)

Minimal-latency tricks

Poll the archive directly on the hot path. Use getEntryByTitle / getEntryByPath first. That avoids HTTP overhead and keeps lookup deterministic. libzim supports exact title/path fetches, and python-libzim exposes whether the archive has a title index or full-text index. (libzim Documentation)
Keep Kiwix HTTP for browse/search, not core inference. kiwix-serve exposes /search and /raw as public endpoints, plus /suggest, /content, and /viewer as private endpoints. That means you can answer fast from direct lookup, then attach a browser link to the full page for the human. (Kiwix Tools)
Use exact title lookup first; use search only as fallback. kiwix-serve’s /suggest uses the title index and can add a full-text-search option when the ZIM includes a full-text index. /search performs full-text search and returns links with snippets, which is useful for ambiguous/descriptive queries, but it should not be the first move on the hot path. (Kiwix Tools)
Use --nodatealiases for cleaner stable links. That lets wikipedia_en_all_2026-03 also resolve as wikipedia_en_all, which is handy if you want stable URLs across snapshot refreshes. (Kiwix Tools)

How the model should use it

Do not let the model answer by regurgitating a full page. That is the dumb path.

The adapter should turn a page into bounded evidence units, for example:

canonical title
path
short lead
headings list
one named section
one small paragraph window
full_article_url

Then the model answers from those bounded units. That keeps ctx use sane and forces the system to populate an answer rather than vomiting page text back at the operator.

In other words:

Kiwix/ZIM = corpus
adapter = evidence slicer
model = synthesizer
HTTP link = human deep-read escape hatch

That is how you get useful “smarts” from a 4B model instead of article-scale mush.

Corpus reality

mini is not full article text. Kiwix defines mini as only the introduction plus infobox. nopic is full articles without images. maxi is full fat. Also, Kiwix currently says incremental ZIM updates are not available; operationally, updates are snapshot-swap, not rolling in-place refreshes. (Kiwix)

For English Wikipedia right now:

wikipedia_en_all_mini_2026-03 is about 12.4 GB raw bytes, roughly 11.6 GiB, and only gives you intros + infoboxes. (Wikimedia Downloads)
wikipedia_en_all_nopic_2026-03 is about 51.9 GB raw bytes, roughly 48.4 GiB, and gives you full article text without images. (Wikimedia Downloads)
wikipedia_en_wp1-0.8_nopic_2026-04 is about 2.0 GB and is the nicest “broad but sane” full-text English subset we discussed. wikipedia_en_top_nopic_2026-03 is about 2.22 GB and is another broad compact option. (Wikimedia Downloads)

Best bang-for-buck bundles

~5 GB class, full text, no images

A very sane bundle is:

WP1 0.8 nopic — 1.995 GB
medicine nopic — 0.862 GB
history nopic — 0.640 GB
astronomy nopic — 0.352 GB
physics nopic — 0.308 GB
geography nopic — 0.521 GB

That totals about 4.68 GB decimal. Practical read: broad core plus medicine/history/hard science/geography, all as full article text without images. (Wikimedia Downloads)

~10 GB class

A clean upgrade is the same bundle plus movies:

movies nopic — 2.129 GB

That brings the bundle to about 6.81 GB decimal, leaving plenty of room under a 10 GB target for one or two more topic packs later. (Wikimedia Downloads)

Operational model

The sane maintenance model is:

download dated ZIM snapshots
validate them
swap the served file/library
keep old snapshot until the new one is verified
delete old snapshot later

That is because Kiwix currently says no incremental updates for Wikipedia ZIMs. So your biannual plan is viable, but it is a snapshot-swap regime, not “keep matching pages and only patch deltas.” (Kiwix)

Bottom line

Yes, this can increase the effective intelligence of your 4B model without touching weights. But only if you treat ZIM/Kiwix as a deterministic local corpus with a thin adapter, not as “giant article dump goes straight into context.”

The winning design is:

hot path: direct libzim title/path lookup
fallback: Kiwix search/suggest
model input: bounded evidence units, not whole pages
human deep read: HTTP link to /viewer or /raw
updates: snapshot-swap twice yearly
best small bundle: WP1 + curated nopic topic packs

PS: ZIM files are here btw

https://dumps.wikimedia.org/kiwix/zim/wikipedia/

SuspciousCarrot78@lemmy.world · edit-2 1 day ago

Somewhere in my documents, I have a scoped ticket for how to use kiwix as the source for the LLM to pull information directly from, populate its answer organically, and naturally respond to question at hand, without word-vomiting a wiki entry complete. The last I looked, you can poll the kiwix DB directly without using the search engine.

I can dig that up for you if it still exists; it’s actually why I’m looking at kiwix (back burner project for now but the spirit moved me).

PS: You’re aware of LLM-wiki? That might suit your purposes better, if your corpus is bespoke and updating. Works nicely.

https://tinyurl.com/llmwiki

surfrock66@lemmy.world · 1 day ago

So this is actively in progress, and right now I’m having trouble getting my tesla P4’s working in my proxmox environment. The P4 is supported for vgpu out of the box, allegedly, but the installer I used is forcing a kernel version pin which isn’t making me happy:

https://github.com/anomixer/proxmox-vgpu-installer/issues/16

So at this time, I’m just connecting API’s.

SuspciousCarrot78@lemmy.world · 2 days ago

That was actually my immediate thought. I already have Wikipedia as a trusted source for llm, but I would prefer to self host and not hammer them.

130GB to fit the entirely of Wikipedia is basically nothing and I’m mildly embarrassed not to have done it already.

surfrock66@lemmy.world · 2 days ago

I also try to participate in some of the farms, running zimit and mwoffliner to help make more archives. Feels like I’m helping.

Iced Raktajino@startrek.website · edit-2 2 days ago

Yep, and I love it.

I’ve got a little Banana Pi M4 Zero (PiZero form factor but much more powerful and with 4 GB RAM) loaded up with, among other useful tools, Kiwix and the full Wikipedia dump. I just refreshed it with the 2026-02 full dump, so I’m caught up for the year. I’ve also got a lot of other offline docs loaded up (React, Bun, and the devdocs for several libraries I use) and it’s nice to have local copies of those instead of googling every time.

Surprisingly, the full ~130 GB Wikipedia dump works fine on a regular Pi Zero 2 with 512 MB RAM. I don’t know how ZIM works but it does work very very well.

clif@lemmy.world · 1 day ago

Similar setup here. Orangepi zero that starts kiwix server at boot and switches the wifi to AP mode. Just plug it in, connect to kiwix WiFi, access kiwix.local via phone browser, and shazam.

Iced Raktajino@startrek.website · 1 day ago

Nice! Those AllWinner boards are a little tricky to get going and have some quirks, but the price is great for the extra horsepower you get. Granted, I use the latest Armbian since the manufacturer’s images are all quite old.

clif@lemmy.world · 1 day ago

When I saw the default configured repos were hosted by Huawei I did a double take, then installed Armbian too : D

SuspciousCarrot78@lemmy.world · edit-2 2 days ago

I actually have a spare pi4 doing nothing, so was thinking of adding this to its jobs list.

130GB for the entire thing? And the pi doesn’t choke on indexing / searching it?

On that: how capable is the search engine (I assume it has one?)

Iced Raktajino@startrek.website · 2 days ago

130GB for the entire thing? And the pi doesn’t choke on indexing / searching it?

That was my thought. I knew it couldn’t hold it in RAM but thought it would be doing crazy IO and limited by being on SD, but it seems to not be a problem. Like I said, I don’t know how ZIM does it, but it does it well. Must have some kind of index that lets it fast travel to the correct blocks or something. I dunno lol.

how capable is the search engine (I assume it has one?)

Yep, it has search. It’s…okay but kind of primitive. It’s not slow, and if you’re searching for something that’s fairly unique (as far as keywords go), it does well. But if you’re searching something like an acronym where it shows up as a regular word in other entries, it’s a lot more hit or miss.

comrade_twisty@feddit.org · 2 days ago

I do on my TrueNAS in a docker container. I have about 1TB of zim files hosted including pre-LLM copies of German, English and French Wikipedia as well as the last two current versions in these languages.

Aditionally I have project Gutenberg Books in german and english as well as lots of random technical, medical, survival, etc stuff that I came accross - a lot of that is trash though, but sorting is too time consuming and my NAS has 48TB so who cares…

SuspciousCarrot78@lemmy.world · 2 days ago

That’s awesome. If I understand correctly, kiwix server creates a local site you can access from anything on your wLAN, as a transparent website? I take it it auto populates with your ZIM files, and that you can add to it (eg: project Gutenberg).

If so, that’s a hell of a thing.

comrade_twisty@feddit.org · 2 days ago

Yep exactly. Also you can have other people (friends/family) have access via VPN, Tailscale, etc.

DishaweslemOride@lemmy.org · 2 days ago

Humorously, you could use an agent to help you sort things. If theres anything it’s good at, it’s sorting.

How do you like TrueNAS? I’m too locked in to Synology at this point—with almost 800tb (in physical drives, less actual because of redundancy), and several devices.

comrade_twisty@feddit.org · 2 days ago

Pretty happy woth TrueNAS, actually came from Synology and bought a UGreen DXP4800Plus, didn’t like the UGOS on it and pretty much immediately switched tp TrueNAS. It’s been absolutely flawless for about 15 months by now, docker integration in the OS is a bit limited by I run my compose stacks managed through dockge anyways.

I won’t let LLMs crawl my data, it’s mine and mine alone :)

Decronym@lemmy.decronym.xyz · edit-2 22 hours ago

Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:

Fewer Letters	More Letters
AP	WiFi Access Point
HTTP	Hypertext Transfer Protocol, the Web
NAS	Network-Attached Storage
VPN	Virtual Private Network

4 acronyms in this thread; the most compressed thread commented on today has 6 acronyms.

[Thread #261 for this comm, first seen 29th Apr 2026, 12:00] [FAQ] [Full list] [Contact] [Source code]

Silent9218@lemmy.zip · 2 days ago

I got Kiwix at home and on my iPhone it rocks.

skip0110@lemmy.zip · 2 days ago

Yes, I self host the English Wikipedia dump, as well as a few cooking sites and topic specific stack exchange dumps available in zim format.

My goal is:

reduce dependence on public internet. In the event of an outage or restriction I’d like some books and other content I can use to entertain myself
locally preserve a snapshot of information before it is possibly diluted by LLM edits

shadybraden@programming.dev · 2 days ago

Yup! Here’s my setup:

https://github.com/shadybraden/compose/blob/main/kiwix/compose.yaml

vegetaaaaaaa@lemmy.world · 2 days ago

Yes. This is my ansible role that deploys it

shems@piefed.social · 2 days ago

I switched to an N150 some time ago, but I previously had it running perfectly on a Pi 4 with only 2GB of RAM. There’s actually a lot more content available than just Wikipedia! You can even archive your own websites using https://zimit.kiwix.org/

It’s fun and Kiwix is impressively lightweight, it uses less than 50 MB of RAM, even with an article loaded.

https://imgur.com/a/DmmqJdh

Archer@lemmy.world · 2 days ago

Is there an actual download link? They want $20 for the Raspberry Pi image

vegetaaaaaaa@lemmy.world · 2 days ago

Damn their website has become a mess. Anyway

[object Object]@lemmy.ca · 2 days ago

Yes, it’s helpful