Why you should have a program check your links
Last updated: Sep 28, 2023
Broken links make a good site look unreliable. But as a site grows, manually keeping track of link quality is bound to fail.
It’s hard to exaggerate how important hypertext is to the internet. It puts the “web” in the World Wide Web. It’s the “H” in HTML.
Early search engines focused on semantic content. The big innovation of Google’s search was to focus on the link. If people link to a page, it’s probably interesting, right? And still today, the number of links that point to a site is the most important part of that site’s search rankings.
The link is the vehicle of the internet. Unfortunately, like cars, links break. They break all the time.
If you run an information-rich site, you should care about broken links. When a link on your site breaks, your reader can’t go where you tell them to go. And information-rich sites probably have a lot of links. A human can’t keep track of all this— way too many links, breaking way too often
The only reliable way to handle this problem is to use a program to check your site for broken links.
Reasons to check links
A documentarian’s first responsibility is always to the reader. Since links break the reading experience, broken links violate the responsibility of documentation.
But other, less reader-centered reasons exist too:
- Links break at shocking rates
- Wikipedia linkrot-research overview.
- It makes your organization look amateur.
- If your links break, what other more complicated systems are you breaking behind the scenes?
- Manually fixing broken links is a huge amount of manual labor.
- If you look through the commit history of a popular docs repo,
check how many commits say something like “link fix.”
If you find a lot, think about how much time was lost to fixing them (don’t forget the cost of context-switching)
If you don’t find any, the maintainers probably already have a link checker.
- Broken links affect SEO
- SEO is mysterious, but the general consensus is that broken links hurt search results1. Besides, if links on your own site break, people who link to it will discover that it’s broken. Maybe they’ll find some other place to link. Having quality sites link to your page is the most important thing for search-engine rankings.
How to avoid broken links
Links break all the time. There’s only one surefire way to prevent it from happening.
The risk-free solution: going linkless
To prevent a site from having broken links, link to nothing.
You can have a long, single page, with no citations and no navigation. Users can arrive by typing your URL into their search bar, and navigate by scrolling or using the arrow keys. When they want to leave, they can press the back button, close their browser, or simply switch off their machine.
If this approach is too spartan for you, you risk having a site with broken links. But that’s okay! There are ways to mitigate the risks.
- Be smart about how you handle links
- Automate times to check links
Take care of your own house first
Other people will let links break. They’ll even say, “Who cares?” I care. You should care. Don’t be a link nihilist!
Cool documentarians know that cool URIs don’t change. (I will probably link this again).
Ideally, internal links shouldn’t break. For two reasons:
- Links to your own pages won’t work, making you look like a real amateur
- People who link to your site expect what they link to stay there.
I know: things move, get deleted, etc. But we should have ways to handle that with server redirects and clever path names. Each time you program a redirect, think about whether you could have selected a more durable path name in the first place.
I will not lie though: I’ve been uncool a time or two. And I’m not even sure how to add redirects (upcoming Nginx-themed post, I hope)!
But, if we have a periodic link checker, at least we can catch where we went wrong.
Be smart about your external links
For external links, the only thing to do is be mindful of what you link to. Is the link necessary? Does it seem stable? That “cool URIs don’t change” link from 1998 is probably safer than a ToS-breaking tweet from 3 a.m. this morning.
But everything could break. We live in a state of total flux.
If you really want to preserve a link, make sure you:
- Make a backup on the internet archive.
- Where possible, link to the root site, instead of in some deep URL path— prefer
example.com
overexample.com/long/ugly/.../path
. - Don’t link to secondary sources that just summarize a primary source.
The “Prevention” section in the Wikipedia article on link rot has more advice.
But even the Internet Archive, even the “cool URIs don’t change” URL, probably won’t survive the heat death of the universe. We live in a state of total flux. Links break and rot.
The only reliable way to catch broken links is to check for them automatically.
Let the computer check links for you
If you have a lot of links, they will break one day. This is a natural part of the internet. But it’s much better if you find your broken links before your readers do.
In human hands, this task of finding all broken links would not only be sad and boring, but also error-prone and inefficient. This is why I advocate for using a program to periodically check links. Fortunately, many good link-checking applications exist (I benchmarked some)
Some organizations have regular audit periods where they run these programs, and fix all broken links. This is still too manual. Better, more programmatic strategies exist.
Strategies to automate link checking
So far I’ve thought of three ways to automate link checking. Note that all assume that you’ve already found a good link-checking tool.
- Every commit, check every page
- Every commit, check only the modified pages
- Schedule a link-checking program to run periodically
Each strategy has a tradeoff
As is often the case with trade offs, one approach’s weakness highlights another approach’s strength, and vice-versa.
- Every commit, check every page
- Pros: If you use a CI, this is simple to set up.
Every time you commit, the check will be thorough.
Cons: If you have a large site, CI time will be slow. This will get especially annoying if you commit frequently. And if you don’t commit frequently, a long time can pass before you find a dead link.
- Every commit, check new pages
- Pros: If you have a big site, this could save time.
It also addresses the most probable break: a new, incorrectly written link.
Cons: You’ll have to figure out what is a new page in your CI script. Old links, especially external ones, could break at any time, and you wont know.
- Use a cron job to scan links periodically
- Pros:
This is simple to set up, and it will catch breaks from new links and old ones.
You can run it independently of your CI, so it won’t slow anything down.
Actually you don’t even need to use a CI or git.
Cons: You need to have a computer that will always be running (not a big deal if you already host your own site). There will always be a period where dead links could arise in between the last cron and the next. You have to make sure that the job reports broken links in a way that you’ll find out.
What is the best way?
I don’t know! Now that I’ve typed this up, it seems like a combination of checking new links and periodically checking the entire site is the best approach. This is the opposite of what I do now, but the approach works for my tiny site.
Further reading
Jacob Nielsen’s “Fighting Link Rot”(
archive
). Ironically, this article’s original URL is broken. Only archives are available now.Do you see my point? Even links about preventing link rot, written by famous web authorities, rot. So check your broken links!
Does fixing broken links Matter?. Moz still mentions that fixing broken links helps SEO even in an article about the times it doesn’t matter that much. ↩︎