How to download a documentation website with Wget

Sometimes you want to download a whole website so you have a local copy that you can browse offline. When programming, this is often useful for documentation that sites that do not provide downloadable versions, and are not available in offline tools like DevDocs.
One tool for making such copies is Wget (from “web get”). With the right flags, Wget can download a whole website and convert it for offline browsing.
Install Wget
Wget is widely available from platform package managers.
On macOS, you can use Homebrew:
$ brew install wget
On Windows, you can use Chocolatey:
> choco install wget
On Linux, most distributions have Wget pre-installed. If not, it’s normally installable from a wget
package.
How to download a website
You can invoke this single big Wget command to download a site, replacing <website>
the URL of the site:
$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent <website>
The URL may be either the full domain such as https://www.example.com
, or have a path prefix such as https://www.example.com/tutorial/en/
. (We’ll take apart all those flags in a few sections.)
Downloading a website can take a little while, even on a fast connection. This is because Wget downloads pages one at a time, in order to discover links as it goes.
Wget stores the downloaded pages in a directory named after the website’s domain name, such as www.example.com
. After Wget has completed, you can open pages from there in your web browser, and navigate as usual.
Example: the Django REST Framework documentation
The DRF documentation is available on DevDocs, but it can be out of date. And unfortunately, the DRF site doesn’t provide downloads.
You can use the above Wget command to download the Django REST Framework documentation like so:
$ wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://www.django-rest-framework.org/
Wget prints a lot of output, starting:
--2021-10-27 10:56:12-- https://www.django-rest-framework.org/
Resolving www.django-rest-framework.org (www.django-rest-framework.org)... 185.199.108.153, 185.199.109.153, 185.199.110.153, ...
Connecting to www.django-rest-framework.org (www.django-rest-framework.org)|185.199.108.153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30663 (30K) [text/html]
Saving to: ‘www.django-rest-framework.org/index.html’
www.django-rest-fr 100%[==============>] 29.94K --.-KB/s in 0.002s
2021-10-27 10:56:12 (13.2 MB/s) - ‘www.django-rest-framework.org/index.html’ saved [30663/30663]
Loading robots.txt; please ignore errors.
--2021-10-27 10:56:12-- https://www.django-rest-framework.org/robots.txt
Reusing existing connection to www.django-rest-framework.org:443.
HTTP request sent, awaiting response... 404 Not Found
2021-10-27 10:56:12 ERROR 404: Not Found.
...
…after downloading every file, Wget finishes by converting links:
...
Converting links in www.django-rest-framework.org/community/3.9-announcement/index.html... 109.
109-0
Converting links in www.django-rest-framework.org/css/prettify.css... nothing to do.
Converting links in www.django-rest-framework.org/css/bootstrap.css... 2.
2-0
Converting links in www.django-rest-framework.org/css/default.css... 1.
1-0
Converting links in www.django-rest-framework.org/css/bootstrap-responsive.css... nothing to do.
Converted links in 73 files in 0.3 seconds.
…and it’s done.
Once Wget has finished, you can check the downloaded files:
$ ls www.django-rest-framework.org
api-guide css index.html search tutorial
community img js topics
Things seem in place. To read the offline copy, you can open index.html
in the browser, and browse away as usual.
Read offline documentation with Python’s web server
Some websites do not work when opened as a .html
file in the web browser. This is because they use web features that browsers block on file://
URL’s, for security. To make such offline copies work, you need to open them over http://
URL’s, via a local web server, and luckily there’s one built in to Python.
For example, take the Django Girls Tutorial at https://tutorial.djangogirls.org/en/ . After downloading the site with Wget, you can open its pages in the browser, but navigation doesn’t work. If you open the browser’s developer console, you’ll see errors from clicking links, such as:
Security Error: Content at file:///.../tutorial.djangogirls.org/en/index.html may not load data from file:///.../tutorial.djangogirls.org/en/intro_to_command_line/index.html.
Uncaught DOMException: The operation is insecure.
These messages are the browser reporting that it is blocking the website’s use of JavaScript for navigation.
You can fix these errors by loading the site through Python’s built-in web server. (This server is only suitable for local development, like Django’s runserver
.)
To do so, navigate to the site folder:
$ cd tutorial.djangogirls.org
$ ls
en gitbook
…then, start the web server:
$ python -m http.server 8001
Serving HTTP on :: port 8001 (http://[::]:8001/) ...
Note this command explicitly uses port 8001, to avoid colliding with Django’s runserver
, which you probably have running. Both http.server
and runserver
default to port 8000.
With the server running, open http://localhost:8001
in the browser, and you’ll find the documentation loads with working navigation. Huzzah!
An Explanation of All the Flags
Wget has very many options. Here’s a brief explanation of the flags we’re using:
--mirror
activates several behaviours for downloading (or mirroring) a whole site.--convert-links
converts links in HTML, CSS, and other files to be relative, so they refer to the offline copies.--adjust-extension
adds appropriate filename extensions where the site does not use them, so that your browser can correctly interpret them.--page-requisites
downloads included resources such as CSS and images, rather than just the HTML.--no-parent
prevents recursion to parent paths, so that Wget only downloads the given sub-path of the website. This doesn’t prevent it from downloading resources that live in a parent path.
Another flag that you may find useful is --wait <n>
, which limits bandwidth consumption by adding a delay of <n>
seconds between requests. This can lighten the load both for others on your internet connection and the web server you’re downloading from.
For more info see the Wget documentation.
Make your development more pleasant with Boost Your Django DX.
One summary email a week, no spam, I pinky promise.
Related posts:
Tags: python