Count 404's in group of access logs

Please be aware this this blog post was published mulitple years ago. It is very possible that links on this page will not work as intended.

I found myself needing to see all of the 404 errors in the access logs for all virtual hosts on my web server. I put all of my logs for a given application (in this case WordPress) in one place (/srv/www/wordpress/logs/$host-access.log). Logrotate kicks in to keep them segmented and compress them by day. A bunch of Unix magic later…

zgrep " 404 " *-access.log* | \
cut -d " " -f 1,7 | \
sed s/\:.*\ /\ / | \
sed s/\-access.*\ /\ / | \
sort | \
uniq -c | \
sort -n -r | \
head -20

zgrep is just grep that handles both normal and gzipped files. Pipe that into cut to pull out just the data we want. The two sed commands pull out data that would mess up the aggregation (the IP address of the requester and part of the filename). Sort puts prepares the stream for uniq to do the counting. Then do a numeric sort in reverse and show the top 20 404’s in all log files.

Output looks like

    380 /wp-content/uploads/2011/09/cropped-20090816-101826-0200.jpg
    301 /wp-content/uploads/2009/06/Peppa-Pig-Cold-Winter-Day-DVD-Cover.jpg
    300 /wp-content/thingelstad/uploads/2011/10/Halloween-2011-1000x750.jpg
    264 /wp-content/uploads/2007/12/guitar-hero-iii-cover-image.jpg
    130 /apple-touch-icon.png
    129 /apple-touch-icon-precomposed.png
    121 /wp-content/uploads/import/o_nintendo-ds-lite.jpg
    114 /wp-content/thingelstad/uploads/2011/10/Crusty-Tofu-1000x750.jpg

Of course the next step would be to further the pipe into a curl --head command to see which 404’s are still problematic. That just makes me smile. :-)

As an aside, sort combined with uniq -c has to be one of the most deceptively powerful yet simple set of commands out there. I’m amazed at how often they give me exactly what I’m looking for.

| 2011 |