How to download a documentation website with Wget
Sometimes you want to download a whole website so you have a local copy that you can browse offline. When programming, this is often useful for documentation that sites that do not provide downloadable versions, and are not available in offline tools like DevDocs.
One tool for making such copies is Wget (from “web get”). With the right flags, Wget can download a whole website and convert it for offline browsing.
Wget is widely available from platform package managers.
On macOS, you can use Homebrew:
$ brew install wget
On Windows, you can use Chocolatey:
> choco install wget
On Linux, most distributions have Wget pre-installed. If not, it’s normally installable from a
How to download a website
You can invoke this single big Wget command to download a site, replacing
<website> the URL of the site:
$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent <website>
The URL may be either the full domain such as
https://www.example.com, or have a path prefix such as
https://www.example.com/tutorial/en/. (We’ll take apart all those flags in a few sections.)
Downloading a website can take a little while, even on a fast connection. This is because Wget downloads pages one at a time, in order to discover links as it goes.
Wget stores the downloaded pages in a directory named after the website’s domain name, such as
www.example.com. After Wget has completed, you can open pages from there in your web browser, and navigate as usual.
Example: the Django REST Framework documentation
The DRF documentation is available on DevDocs, but it can be out of date. And unfortunately, the DRF site doesn’t provide downloads.
You can use the above Wget command to download the Django REST Framework documentation like so:
$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://www.django-rest-framework.org/
Wget prints a lot of output, starting:
--2021-10-27 10:56:12-- https://www.django-rest-framework.org/ Resolving www.django-rest-framework.org (www.django-rest-framework.org)... 184.108.40.206, 220.127.116.11, 18.104.22.168, ... Connecting to www.django-rest-framework.org (www.django-rest-framework.org)|22.214.171.124|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 30663 (30K) [text/html] Saving to: ‘www.django-rest-framework.org/index.html’ www.django-rest-fr 100%[==============>] 29.94K --.-KB/s in 0.002s 2021-10-27 10:56:12 (13.2 MB/s) - ‘www.django-rest-framework.org/index.html’ saved [30663/30663] Loading robots.txt; please ignore errors. --2021-10-27 10:56:12-- https://www.django-rest-framework.org/robots.txt Reusing existing connection to www.django-rest-framework.org:443. HTTP request sent, awaiting response... 404 Not Found 2021-10-27 10:56:12 ERROR 404: Not Found. ...
…after downloading every file, Wget finishes by converting links:
... Converting links in www.django-rest-framework.org/community/3.9-announcement/index.html... 109. 109-0 Converting links in www.django-rest-framework.org/css/prettify.css... nothing to do. Converting links in www.django-rest-framework.org/css/bootstrap.css... 2. 2-0 Converting links in www.django-rest-framework.org/css/default.css... 1. 1-0 Converting links in www.django-rest-framework.org/css/bootstrap-responsive.css... nothing to do. Converted links in 73 files in 0.3 seconds.
…and it’s done.
Once Wget has finished, you can check the downloaded files:
$ ls www.django-rest-framework.org api-guide css index.html search tutorial community img js topics
Things seem in place. To read the offline copy, you can open
index.html in the browser, and browse away as usual.
Read offline documentation with Python’s web server
Some websites do not work when opened as a
.html file in the web browser. This is because they use web features that browsers block on
file:// URL’s, for security. To make such offline copies work, you need to open them over
http:// URL’s, via a local web server, and luckily there’s one built in to Python.
For example, take the Django Girls Tutorial at https://tutorial.djangogirls.org/en/ . After downloading the site with Wget, you can open its pages in the browser, but navigation doesn’t work. If you open the browser’s developer console, you’ll see errors from clicking links, such as:
Security Error: Content at file:///.../tutorial.djangogirls.org/en/index.html may not load data from file:///.../tutorial.djangogirls.org/en/intro_to_command_line/index.html. Uncaught DOMException: The operation is insecure.
You can fix these errors by loading the site through Python’s built-in web server. (This server is only suitable for local development, like Django’s
To do so, navigate to the site folder:
$ cd tutorial.djangogirls.org $ ls en gitbook
…then, start the web server:
$ python -m http.server 8001 Serving HTTP on :: port 8001 (http://[::]:8001/) ...
Note this command explicitly uses port 8001, to avoid colliding with Django’s
runserver, which you probably have running. Both
runserver default to port 8000.
With the server running, open
http://localhost:8001 in the browser, and you’ll find the documentation loads with working navigation. Huzzah!
An Explanation of All the Flags
Wget has very many options. Here’s a brief explanation of the flags we’re using:
--mirroractivates several behaviours for downloading (or mirroring) a whole site.
--convert-linksconverts links in HTML, CSS, and other files to be relative, so they refer to the offline copies.
--adjust-extensionadds appropriate filename extensions where the site does not use them, so that your browser can correctly interpret them.
--page-requisitesdownloads included resources such as CSS and images, rather than just the HTML.
--no-parentprevents recursion to parent paths, so that Wget only downloads the given sub-path of the website. This doesn’t prevent it from downloading resources that live in a parent path.
Another flag that you may find useful is
--wait <n>, which limits bandwidth consumption by adding a delay of
<n> seconds between requests. This can lighten the load both for others on your internet connection and the web server you’re downloading from.
For more info see the Wget documentation.
Make your development more pleasant with Boost Your Django DX.
One summary email a week, no spam, I pinky promise.