Word Counting My Whole Site2019-09-30
My site is static HTML, built with Jekyll (more details in my colophon). This means I have a folder that contains the whole site in HTML files.
I wanted to find the total word count. I found this combination of commands works great:
find . -iname "*.html" | parallel pandoc -t plain | wc -w
findto list the relative paths of all the HTML files locally.
parallelto run a command on each file in parallel.
pandocdocument converter to convert the input HTML to plain text.
wcto calculate the total word count.
It took about 2 seconds on my computer to tell me my site currently has about 75,000 words. More than I expected, though this counts words in footers etc. many times over.
Thanks to pandoc’s universality, you can also use this to count words in many file formats: markdown, reStructuredText, MS Word, etc.
If your site is more dynamic, but still small enough to download, you might consider using GNU
--recursive flag will let you download every page as HTML locally, following links to find everything on the website.
For an example see this GIST.
🦄 Working on a Django project? Check out my book Speed Up Your Django Tests.
One summary email a week, no spam, I pinky promise.
Tags: commandline, jekyll
© 2019 All rights reserved.