Simple link checking describes how I check internal links, but what about checking external links? You need to traverse every page on your website to find the external links to check, but you definitely must not traverse every page on external sites! I found a tool that supports this: linkchecker. (Note that development seems to have stopped there and moved to a new group, but I’ve linked to the version packaged by Debian stable because that’s what I’m using.)
The key to checking external links without recursing through external websites
with linkchecker is to pass the flags --check-extern
and
--no-follow-url=!://DOMAIN-BEING-CHECKED/
(e.g.
--no-follow-url=!://www.johntobin.ie/
); this will check external links but
will not recurse on any URL that doesn’t match ://DOMAIN-BEING-CHECKED/
.
I ran linkchecker like this for a month or so, but after the initial cleanup
where I fixed some broken links it was too noisy - there are temporary failures
frequently enough that the signal-to-noise ratio was very low. Some sites
consistently fail, e.g. Wikipedia and Amazon, and consistent failures can easily
be excluded with --ignore-url=//en.wikipedia.org
, but most failures are
transient. (Wikipedia and Amazon block the default linkchecker User-Agent
,
setting the User-Agent
to match Chrome’s Usr-Agent
fixes them.) linkchecker
supports a simple output format of failure_count URL
that is updated on each
run, but the counters are never reset and it doesn’t track when the failures
occurred so the signal-to-noise ratio for alerts from that would decline over
time.
I decided to write a wrapper to post-process the results and only warn about
URLs that fail multiple times in a short period. Happily linkchecker supports
SQL output, so I can import the failures into an
SQLite database and easily query it. The
schema that linkchecker uses is fine except it doesn’t have a timestamp, but
that was easy to solve with SQLite: when creating the database I add an extra
column defined as timestamp DATETIME DEFAULT CURRENT_TIMESTAMP
that will
automatically be populated when rows without it are inserted. I arbitrarily
picked 3 failures in 2 days as the thresholds for warning about a URL, but I
increased it to 12 failures in 2 days (1 failure every 4 hours) after too many
false positives. I only count each URL as failing once per hour, regardless of
how many times it failed within that hour, to avoid alerting when a URL that is
linked many times has a temporary failure. The output looks like this:
Bad URLs for https://www.johntobin.ie/ since 2019-06-08
https://www.example.org/directory
https://www.example.org/directory/
Output in /tmp/linkchecker-cron.t8EB6fzK2F
To investigate further I can use SQL queries. The output files are also available for debugging when linkchecker fails, otherwise they are cleaned up. Both the output files and the database contain the referring URL for failures, so it’s easy to go edit the page and fix the link if there is a genuine failure, e.g. several links in my blog needed to be updated because the destinations had been moved over the years.
The wrapper program is linkchecker-cron and my linkcheckerrc might also be useful.