Word Counting My Whole Site
My site is static HTML, built with Jekyll (more details in my colophon). This means I have a folder that contains the whole site in HTML files.
I wanted to find the total word count. I found this combination of commands works great:
find . -iname "*.html" | parallel pandoc -t plain | wc -w
findto list the relative paths of all the HTML files locally.
parallelto run a command on each file in parallel.
pandocdocument converter to convert the input HTML to plain text.
wcto calculate the total word count.
It took about 2 seconds on my computer to tell me my site currently has about 75,000 words. More than I expected, though this counts words in footers etc. many times over.
Thanks to pandoc’s universality, you can also use this to count words in many file formats: markdown, reStructuredText, MS Word, etc.
If your site is more dynamic, but still small enough to download, you might consider using GNU
--recursive flag will let you download every page as HTML locally, following links to find everything on the website.
If your Django project’s long test runs bore you, I wrote a book that can help.
One summary email a week, no spam, I pinky promise.
Tags: commandline, jekyll