Stopping MediaWiki Spam with Dynamic Questy Captchas

MediaWiki websites are often plagued by spammers. It’s annoying in the extreme. If you setup a blank MediaWiki website and do nothing it is likely that within a couple of weeks your site will be found, and in a matter of days you will have thousands of spam user accounts and tens of thousands of pages of spam. There are a number of ways to stop wikispam. I tried using Recaptcha to little benefit. I still got a large number of spam registrations on my publicly available wikis. I’ve found the combination below to be incredible efficient.

I started to use QuestyCaptcha which is a plugin for the ConfirmEdit extension (WikiApiary) which uses a simple question/answer paradigm and that worked well. However, the hard part with Questy is figuring out what questions to use. Particularly if your wiki is global, you need to avoid using questions that are specific to one culture or location, or even language. What color is the sky? Well, in what language? For WikiApiary I want to make sure that people from anywhere are able to register. I started with simple questions, like “What is the name of this website?” but that was quickly defeated and spam registrations started showing up. What to do?

I decided to see if QuestyCaptcha could accommodate dynamic questions and it can! So, the first thing I did was decide to use a question that required someone to know something generic, but could also be easily found with a single click on a URL. My choice was to ask about the GMT time, because if you search for “gmt time” on Google, it tells you the answer. The PHP function gmdate will also give the answer. So I created two questions that asked for the day of the week and the hour (24 hour time) at GMT, and provide hyperlinks to the answer. This worked great!

Then I decided to go a little further and ask a very dynamic question. This time I generated an 8 character random string, and ask the user to identify one of the characters in the string. No language issue! No culture challenges! Simple. Here is the code for both of these solutions as it would appear in your LocalSettings.php file.

# Let's stop MediaWiki registration spam
require_once( "$IP/extensions/ConfirmEdit/ConfirmEdit.php" );
require_once("$IP/extensions/ConfirmEdit/QuestyCaptcha.php");
$wgCaptchaClass = 'QuestyCaptcha';
 
# Set questions for Questy
# First a couple that can be answered with a linked to Google search
$wgCaptchaQuestions[] = array (
    'question' => "What day of the week is it at <a href='http://google.com/search?q=gmt+time'>Greenwich Mean Time</a> (GMT) right now?",
    'answer' => gmdate("l")
);
$wgCaptchaQuestions[] = array (
    'question' => "In 24-hour format, what hour is it in <a href='http://google.com/search?q=gmt+time'>Greenwich Mean Time</a> (GMT) right now?",
    'answer' => gmdate("G")
);
 
# Now a more complicated one
# Generate a random string 8 characters long
$myChallengeString = substr(md5(uniqid(mt_rand(), true)), 0, 8);
# Pick a random location in those 8 strings
$myChallengeIndex = rand(0, 7) + 1;
# Let's use words to describe the position, just to make it a bit more complicated
$myChallengePositions = array ('first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth');
$myChallengePositionName = $myChallengePositions[$myChallengeIndex - 1];
# Build the question/anwer
$wgCaptchaQuestions[] = array (
    'question' => "Please provide the $myChallengePositionName character from the sequence <code>$myChallengeString</code>:",
    'answer' => $myChallengeString[$myChallengeIndex - 1]
);
 
# Skip CAPTCHA for people who have confirmed emails
$wgGroupPermissions['emailconfirmed']['skipcaptcha'] = true;
$ceAllowConfirmedEmail = true;

After putting these in place I’ve had nearly zero spam registrations (1 or 2 were clearly done by a human testing it). Now, can this be broken? Sure, easily. But not nearly as easy as static questions that could be harvested by a person and then put into a tool to automatically create accounts. In order to attack me, spammers would have to write a special handler that dealt with the randomness of the questions. This is very unlikely.

Feel free to use these examples, or, use other dynamic question/answer combinations. It’s not obvious that this type of configuration works with QuestyCaptcha, but it does and it allows for very powerful spam blocking.

Display MediaWiki job queue size inside your wiki

I wanted to find an easy way to show the current size of the MediaWiki job queue on WikiApiary. When you make changes to templates that are used on thousands of pages the queue can get backed up and it’s nice to have an easy way to keep an eye on this. The job queue is even one of the data points WikiApiary tracks and graphs. But I wanted something that was as close to realtime as possible. It wasn’t hard to do. This solution uses the External Data (see WikiApiary usage page) extension.

First we need to get the data. Let’s start by calling the siteinfo API method. The magic words are used to make this generic, and this should result in the right URL. If you are using a protocol-relative server setting you will have to modify this.

{{#get_web_data: url={{SERVER}}{{SCRIPTPATH}}/api.php?action=query&meta=siteinfo&siprop=statistics&format=json
  | format=JSON
  | data=jobs=jobs}}

Now External Data has done the work and the value is stored for us. Now we get it by simply calling:

{{#external_value:jobs}}

I like to put that in a style so it’s big and obvious.

If you’re thinking ahead you’ll now be saying “Yeah, that’s neat, but it will be cached in MediaWiki for a hours!”. Yes, it will be unless you add a __NOCACHE__ directive to the page and use the MagicNoCache extension. This extension allows you to disable the MediaWiki cache on a page-by-page basis, very handy.

If you wanted to use this in multiple places you could even put an <onlyinclude> around it and transclude the job queue size in other pages, however I would be cautious about that if using the __NOCACHE__ directive as well.

MediaWiki Template Filter Title

I was recently doing some cleaning on our Read/Write Book Club website and ran into an interesting challenge. All of the books in the wiki are in a couple of categories, but I wanted them sorted right ignoring A, An and The beginning of the title. MediaWiki supports this in the category tag allowing you to specify [Category:Book|Sort Title] and early on in the wiki I had a second field in the form for Sort Title asking the person editing the book to do this.

The result was nobody did it and all the books with “The” in the beginning of the title were all under T. Shouldn’t this be easy to just deal with in the wiki itself?

Well, it turned out to be much harder than you would think in large part because MediaWiki doesn’t honor spaces in template tags. My first attempt to do this was rather brute force, simply look for the three cases that I want to get rid of in the title and chop it off.

<includeonly>
{{#if:{{{1|}}} | {{#vardefine:title_filter_temp|{{{1}}} }}
{{#if: {{#pos:{{#var:title_filter_temp}}|The }} | {{#ifexpr: {{#pos:{{#var:title_filter_temp}}|The }} = 0 | {{#vardefine:title_filter_temp| {{#sub:{{#var:title_filter_temp}}|4}} }} }} }}
{{#if: {{#pos:{{#var:title_filter_temp}}|A }} | {{#ifexpr: {{#pos:{{#var:title_filter_temp}}|A }} = 0 | {{#vardefine:title_filter_temp| {{#sub:{{#var:title_filter_temp}}|2}} }} }} }}
{{#if: {{#pos:{{#var:title_filter_temp}}|An }} | {{#ifexpr: {{#pos:{{#var:title_filter_temp}}|An }} = 0 | {{#vardefine:title_filter_temp| {{#sub:{{#var:title_filter_temp}}|3}} }} }} }}
{{#var:title_filter_temp}}
| No parameter passed to [[Template:Filter title]]. }}</includeonly>

This worked in many cases, but not all. A book like Antifragile got in trouble with this approach since the “An” matched it got sorted in “T”. You would think this would be an easy fix right? Don’t look for “An” but instead for “An “, including the space in the match. Unfortunately it is nearly impossible to pass a space into a MediaWiki template. MediaWiki effectively trims all template inputs of spaces so a space by itself becomes, effectively, null. A different approach was needed.

After some consideration I came up with this approach that uses the Arrays extension. I like it a lot more than the first attempt! The basic idea is to break the title into an array of strings on the space (note that #arraydefine allowed me to use a regex pattern to avoid the problem of not being able to pass in a space). I then check if the first element in that array matches a set of targets (in the #switch statement). If it does, set the index to 1, otherwise 0, and build a new array from that index offset. Like this:

<includeonly>{{
#arraydefine:filter_title_temp|{{{1|No title was provided}}}|/\s/}}{{
#switch: {{#arrayindex:filter_title_temp|0}}
 | A | An | The = {{#vardefine:filter_title_i|1}}
 | #default = {{#vardefine:filter_title_i|0}}
}}{{
#arrayslice: filter_title_new | filter_title_temp | {{#var:filter_title_i}} }}{{
#arrayprint: filter_title_new | _ | @@@@ | @@@@ }}{{
#arrayreset:filter_title_temp|filter_title_new}}</includeonly>

This works great with one exception. I still get confounded with the space problem when assembling the new title in the #arrayprint method. I decided to print the new title with underscores where the spaces were. Since this is used for the sorting condition, this is fine. The end user never sees the title and the wiki will sort right if given Title_of_the_Book.

Now the sortable titles are all generated and the Book Category page looks awesome.

Bookmarking with Semantic MediaWiki

I have been doing a lot of exploration using MediaWiki and the Semantic Mediawiki suite of extensions. I’ve deployed a number of wikis doing a wide variety of things. For a few months I had been pondering the idea of hosting my own bookmarking site using Semantic MediaWiki. I decided to give it a try and put together links_thing.

First a quick primer. Semantic MediaWiki is an extension that lets you store and query data in wiki pages. Wikis have been awesome at dealing with documents and text for a long time. But if you wanted to put a table of data in a wiki that didn’t work very well. And if you wanted to query that table of data? Well, that was just crazy. Semantic MediaWiki gives you the ability to associate properties with pages and then query them. So, for my bookmarks each wiki page in the category bookmark is a bookmark and has a number of properties, including things like Has URL, Created at, Has excerpt. You get the idea. You put all this logic into the Templates that the wiki uses, making them into Semantic Templates and even the data entry can be made user friendly using Semantic Forms to create fancy forms with a variety of standard controls.

Making it

Building this wiki was pretty easy. I mostly just thought for a while about the properties that a bookmark has. I wanted to get it right since I could use the scaffolding that Semantic MediaWiki has to create a “class” and template out all the basic stuff. It’s easy enough to add after the fact too. After making the class for Bookmarks there was only one real thing I needed to prove. I had to be able to have a Bookmarklet that would automatically populate the URL, Title and Excerpt for a bookmark. Of course the timestamp needed to be done to but I knew how to do that.

After some digging I figured out how to pass parameters into the Forms to pre-populate fields and also how to tell Semantic Forms to name the page based on a field in the form (namely, the TItle of the bookmark). After proving that out I was ready to go.

Importing

I wasn’t willing to lose any data, and I knew it was just a matter of shoveling. I used Pinboard’s JSON export and then whipped up a little Python program to turn that into a CSV that could be imported using the Data Transfer extension. I easily imported just under 4,000 links and had all my data there.

Fun Stuff

I’ve been using my new Bookmark wiki exclusively for the last few weeks and I’m absolutely loving it. Here are just some of the reasons why:

  1. It is mine. Put simply, I don’t need to rely on anyone else to keep it working of me. For an archive, this feels reassuring.
  2. This seems simple, but it’s so helpful to be able to do regular expression driven find and replace through all my bookmarks. I’ve probably done 50 of these cleaning things up. For example, I didn’t like that a lot of bookmarks had a title that ended in ” – Home” or ” – My Super Cool Blog”. A quick search and replace and they are gone.
  3. I thought it would be interesting to see my bookmarks on a calendar. Seems like a simple thing but I don’t think any bookmarking service does it. So I made a calendar view.
  4. Wouldn’t it be nice to be able to see YouTube and Vimeo videos I bookmark without having to go to the video pages one at a time? I made a video view.
  5. I really want my bookmarking tool to have URL Checking. I hate short URL’s because I suspect they will go away. I also don’t like analytics tags being bugged into my URLs. I have a Check URL Template that checks for these in my wiki, and a bot that cleans them up.
  6. I thought it would be cool to see statistics on my bookmarks so I created a Bookmark Statistics view.

This is just the beginning. I’m sure I’ll be adding a lot of other tweaks over time.

What’s next?

I’ve now building a little Python application called LinkBot. LinkBot runs on a schedule and validates URL’s for me. I’ll write up about that application separately.

I would love to share this suite of templates and properties with anyone else. It’s easy to export the pages off of my wiki and import them into your own. If you are interested in doing I have cloning information and feel free to comment here and we’ll connect.

MediaWiki Template Get Hostname

I was working on a template for one my personal wikis and needed to get the hostname for a given URL. Using the capabilities of the Parser Functions extension for MediaWiki I whipped up this template. I figured others may find this useful so here it is. The first version has a bunch of spaces and newlines added to make it more readable.

{{#vardefine: hoststart | {{#expr: {{#pos: {{{1|}}} | // }} + 2 }} }}
{{#vardefine: hostend | {{#pos: {{{1|}}} | / | {{#expr: {{#pos: {{{1|}}} | // }} + 2 }} }} }}
{{#vardefine: hostlen | {{#expr: {{#var: hostend }} - {{#var: hoststart }} }} }}
{{#sub: {{{1|}}} | {{#var: hoststart}} | {{#var: hostlen}} }}

To put it in your own MediaWiki, copy this version that removes the spaces and newlines.

To use this template put it on a page like Template:Get hostname and then call it in your pages as

{{Get hostname|http://thingelstad.com/another-reason-you-need-to-use-a-password-manager/}}

which will return thingelstad.com. You can also find this template on MediaWiki Cookbook.

WeSolver

I’ve been having a lot of fun working with MediaWiki and particularly the Semantic MediaWiki extensions. A few months ago a friend from when I worked at Dow Jones, Armistral, asked me for some input on how he could build this website he was working on. He wanted to create a site where people could work together to solve problems. Thus WeSolver was born. I strongly recommended that he use MediaWiki and he ran with it. The site is now live and he’s done a nice job setting it all up. Check it out and if your so inspired see what you can do to help with a solution!