I found myself needing to see all of the 404 errors in the access logs for all virtual hosts on my web server. I put all of my logs for a given application (in this case WordPress) in one place (
/srv/www/wordpress/logs/$host-access.log). Logrotate kicks in to keep them segmented and compress them by day.
A bunch of Unix magic later...
zgrep " 404 " *-access.log* | \ cut -d " " -f 1,7 | \ sed s/\:.*\ /\ / | \ sed s/\-access.*\ /\ / | \ sort | \ uniq -c | \ sort -n -r | \ head -20
zgrep is just grep that handles both normal and gzipped files. Pipe that into cut to pull out just the data we want. The two sed commands pull out data that would mess up the aggregation (the IP address of the requester and part of the filename). Sort puts prepares the stream for uniq to do the counting. Then do a numeric sort in reverse and show the top 20 404's in all log files.
Output looks like
380 thingelstad.com /wp-content/uploads/2011/09/cropped-20090816-101826-0200.jpg 301 thingelstad.com /wp-content/uploads/2009/06/Peppa-Pig-Cold-Winter-Day-DVD-Cover.jpg 300 thingelstad.com /wp-content/thingelstad/uploads/2011/10/Halloween-2011-1000x750.jpg 264 thingelstad.com /wp-content/uploads/2007/12/guitar-hero-iii-cover-image.jpg 130 thingelstad.com /apple-touch-icon.png 129 thingelstad.com /apple-touch-icon-precomposed.png 121 thingelstad.com /wp-content/uploads/import/o_nintendo-ds-lite.jpg 114 thingelstad.com /wp-content/thingelstad/uploads/2011/10/Crusty-Tofu-1000x750.jpg
Of course the next step would be to further the pipe into a
curl --head command to see which 404's are still problematic. That just makes me smile. :-)
As an aside, sort combined with uniq -c has to be one of the most deceptively powerful yet simple set of commands out there. I'm amazed at how often they give me exactly what I'm looking for.