Word Counting My Whole Site

My site is static HTML, built with Jekyll (more details in my colophon). This means I have a folder that contains the whole site in HTML files.
I wanted to find the total word count. I found this combination of commands works great:
find . -iname "*.html" | parallel pandoc -t plain | wc -w
It uses:
- UNIX
findto list the relative paths of all the HTML files locally. - GNU
parallelto run a command on each file in parallel. pandocdocument converter to convert the input HTML to plain text.- UNIX
wcto calculate the total word count.
It took about 2 seconds on my computer to tell me my site currently has about 75,000 words. More than I expected, though this counts words in footers etc. many times over.
Thanks to pandoc’s universality, you can also use this to count words in many file formats: markdown, reStructuredText, MS Word, etc.
If your site is more dynamic, but still small enough to download, you might consider using GNU wget. Its --recursive flag will let you download every page as HTML locally, following links to find everything on the website.
😸😸😸 Check out my new book on using GitHub effectively, Boost Your GitHub DX! 😸😸😸
One summary email a week, no spam, I pinky promise.
Related posts:
Tags: commandline, jekyll