Informal benchmarks of some link-checking tools
Last updated: Sep 28, 2023
The project that I’m working on needs a link checker. To research, I tested out some different tools and made some informal benchmarks.
I’ve been working on the docs site for k6. Lately, we’ve been running into a lot of broken links— enough that everyone agrees it’s time to get an automatic checker.
I want to find the best link-checking tool for my situation. I need a tool that:
- Can check links recursively. I want to set only one target.
- Can be embedded in a CI. I want checking to be automatic.
- Is fast. I don’t want to bloat CI time.
- Has output that makes it easy to notify about broken links. I want to know when links break.
- Is accurate. A bunch of false positives would cause too much trouble.
- Is easy to use and configure. I don’t want to fiddle. I don’t even know how to fiddle.
I don’t need any special configuration. I’m not worried about handling redirects, TLS, or anything like that (I can’t worry if I don’t understand).
“Experimental” conditions and goals.
I’m not pretending that this is a scientific study. But at least I can be honest about the test conditions.
Test conditions
This test happened on 2022-09-08
.
Each link checker was installed fresh.
Because this doc is part of my research for my work on the k6 docs,
I’m running the tests on https://k6.io/docs
.
I benchmarked the commands with perf stat
.
I ran this on my computer (Linux), using my home network. It was raining that day, off and on.
“Experimental” goals
I want to do as little configuration as possible. At first, I just wanted to do a recursive check and time it.
After a smoke test, I discovered two more things to configure:
- Apparently all Linkedin links return a status of
999
. I guess this is an anti-bot mechanism. But that Linkedin page link is in the metadata for every page. It would generate a lot of noisy false positives. - I wanted to save the output to analyze and compare it.
So, I configured each tool to do the following:
- Recursively check all links on
https://k6.io/docs
(I assume the tool knows not to do recursive checks on non-k6* domains) - Ignore the
linkedin.com
domain. - Redirect the output to a file
Beyond that, I didn’t do anything else. I would’ve refused to do more. I might have been able to use the tools more efficiently, but it would’ve violated my easy-to-use-and-configure criteria.
The link checkers that I tested
The README for the Lychee link checker has a nice feature-comparison table. One of its rows identifies the tools that can search recursively, of which there were five. 1 Of those five, one seems to work on only markdown, so I ignored it.
I also found htmlproofer, but the first time I tried it, it bailed with a Ruby traceback and that was enough for me. No disrespect at all! I’ve heard it’s a great, full-featured tool, but I didn’t want to fiddle around.
That left my test with the following tools. Here are links to the repos and the commands I ran to benchmark them.
- Muffet
perf stat muffet \ --exclude="https://www.linkedin.com/company/k6io" \ "https://k6.io/docs" > muffet.txt
- Broken-link-checker
perf stat broken-link-checker \ --exclude "https://www.linkedin.com/company/k6io" \ --recursive https://k6.io/docs > broken-link-checker.txt
- Linkinator
- Redirecting the output directly looked weird (file was full of gobbeldy-gook), so I used the built-in CSV format.
perf stat linkinator \ --skip https://www.linkedin.com/company/k6io \ --recurse "https://k6.io/docs" \ --format CSV > linkinator.txt
- Linkchecker
perf stat ~/.local/bin/linkchecker \ --ignore-url="https://linkedin.com/company/k6io" \ --file-output=text/linkchecker.txt https://k6.io/docs
I ran each command one by one. I didn’t pay any attention to the other processes running on my computer.
This test is pseudoscientific
Don’t take these numbers too seriously. I don’t want to understate the informality of this test or my lack of credentials:
- This test is not reproducible, as the target will have different links in the future.
- It’s possible the number of links even changed from test to test.
- I’m not sure that each tool is testing the same thing.
- I only ran the tests
oncetwice, so random network issues could’ve caused huge variation. - I have no experience benchmarking—I wouldn’t even bet much money I’m using the word correctly.
- I don’t know how
perf stat
works. I was going to usetime
, but right before I did this, I read on a forum somewhere thatperf stat
was better. That’s all the research I did.perf
was already installed. - I’m biased towards linkinator, because I already use it.
The results
Of the final four, Muffet and Linkinator were the clear winners. Actually, they were the only winners—the other two took too long and I aborted them.
I decided to look only at 404
s in the results.
I realized I was getting a lot of HTTP status codes that maybe I don’t care about or were misleading.
For example, lots of links to GitHub repos were returning 429
.
I’m not sure whether that affected test speed.
Tests that finished
I also tested that failed links would cause the test to exit with a non-zero code, which I think makes it easier to handle in the CI.
To count 404
s, I just used grep
and awk
and removed some obvious false positives.
I didn’t look hard.
I’m not sure why Muffet found more 404
s: maybe I’ve included more false positives; maybe it found more breaks.
The supplemental section links the file directory. If you want to compare the two outputs to find the discrepancies, feel free!
Link Checker | time elapsed (seconds) | cycles (millions) | 404 s found | echo $? |
---|---|---|---|---|
Muffet | 68.52 | 55.343 | 121 | 1 |
linkinator | 137.29 | 73.575 | 103 | 1 |
I ran these twice. I don’t know why linkinator got so much faster the second time.
Link Checker | time elapsed (seconds) | cycles (millions) |
---|---|---|
Muffet | 63.92 | 54.94 |
linkinator | 34.14 | 65.66 |
Tests I aborted
These tools were taking too long, so I aborted their runs. It would take too much time in our CI.
Link Checker | time elapsed (seconds) | cycles (millions) | 404 s found | echo $? |
---|---|---|---|---|
link checker | 746.46 | 444.404 | 6 | 1 |
broken-link-checker | 384.43 | 57.357 | 9 | 0 |
Discussion
Given that Muffet is written in Go and linkinator in TypeScript, I expected Muffet to perform best. While this is true of CPU usage, the results are inconclusive about which tool is most time-efficient.
I’m not sure why linkinator did so much better on the second run. The first run, linkinator was 68.77s slower than Muffet. On the second run, it was 29.78s faster.
Congrats to Muffet and linkinator
In the end, Muffet and linkinator both seem like great choices for my use case.
- They both are fast.
- They both check recursively.
- They both have output that’s reasonable to filter from.
- They both can be stuck in a CI.
Both have GitHub actions, though neither action seems very active.
Linkinator seems a little more full-featured, but it might be less efficient. Muffet looks more spartan, but that could be a good thing for some, especially if they already prefer to work with a Go codebase.
Next steps
This was fun to do. Maybe I could expand this and make the results more reliable.
Perhaps I could come up with a little set of shell scripts to run these repeatedly,
to make multiple benchmarks.
Before I do that, I’d need to make sure the testing is more accurate: at the very least, I need to figure out how to count 404
s across tools!
Supplemental notes
- I stuck the output files in a data directory in the repo of this site.
- Blog about using a Muffet in GitHub actions
- Blog about using linkinator in GitHub Actions
At the time of writing, lychee itself didn’t support recursive checking. I thank the maintainer for the table, though. Today wasn’t the first time I’d looked at it. ↩︎