thingelstad

Jamie Thingelstad's personal website

Category: Techie (page 1 of 35)

Huge Impact of Linode Cloud Updates

I host all of my personal projects on two servers at Linode. Last week Linode announced new “cloud servers” with SSD’s, double RAM and a new chip architecture. I migrated both of my hosts over to the new servers that evening and the performance impact was immediately noticeable. WikiApiary is the most taxing project I run, and it was clearly much faster. This graph though of the WikiApiary API response time is the most telling. So much better!

response-time

“Walk into a room of people just like you.”

Over recent years I’ve been growing increasingly concerned about the lack of women in technology careers. Perhaps it’s being a dad, or just getting older. Either way, I think this is bad for our industry. I believe we would have healthier cultures, better teams and make better software and products if we had more diversity.

I recently got an email invite to an event in town for tech entrepreneurs. The headline of the email exclaimed in large type…

Walk into a room of people just like you.

In the email were three photos to highlight the people just like you. All set in the gorgeous CoCo Minneapolis space.

Notably a room of people just like you, if you are a young, white man.

I’m not highlighting it because I think there was anything intentional with these images. But instead just to highlight something that I don’t think many in tech even see. We rarely notice the absence of any women in these scenes. We need to work harder to create an inclusive environment that draws the great women technologists into our events too.

On a related note, many know I’m on the board of minne✱ which hosts minnedemo and minnebar. We are continuing to work hard to make sure we get all technologists to our events. We have a lot of work to do, but making sure that our imagery displays an open and accepting event is an important start.

Dropbox Arbitration Opt-Out

If you missed the news that Dropbox is now automatically putting all users into an arbitration agreement you should take a moment to opt-out of this change. You can go to

https://www.dropbox.com/arbitration_optout

And easily opt out.

Opt-Out of Google Plus Gmail integration

It’s lame when the only time you need to log into services is to disable privacy invading features that I have no interest in.

google-plus-opt-out

One Year for WikiApiary

Yesterday WikiApiary had a very meta tweet when it wished itself Happy Birthday! A while back I realized that if you looked the edit history of the “Main Page” of a MediaWiki website you could infer the date the wiki was started, it’s birthday. WikiApiary then wishes wikis a happy birthday on that date. Yesterday was WikiApiary’s big day.

WikiApiary LogoThe first year of WikiApiary has been great! The comments people make about it and the great contributions that many people have made to the wiki reflect the utility and interest in the data. WikiApiary was a holiday break project for me in 2012 and it’s continued to get additions and modifications from a number of people throughout the world. It is the first project I’ve started that I feel has gotten a true community around it and people that are moving it forward independent of what I’m doing. That is really great! This idea of a “Wiki to track other wikis” clearly caught on with some people.

In WikiApiary’s first year it has collected 1,855,979,520 statistics samples in its database, just 2.6GB of data. As of today, WikiApiary is collecting data from 9,555 active wikis. It shows 2,478,637 active users over 384,870,041 pages with 2,894,060,197 edits in the part of the wikiverse that it monitors.

Looking at visitor activity during this first year, WikiApiary had 32,416 visits with 105,895 page views. 9,346 of those visits were from MediaWiki.org. The top 10 countries visiting the site were:

United States 10,243 31.6%
Germany 3,422 10.6%
United Kingdom 2,095 6.5%
Russian Federation 1,853 5.7%
France 1,084 3.3%
Canada 1,000 3.1%
Netherlands 814 2.5%
India 755 2.3%
Spain 714 2.2%
China
687 2.1%

WikiApiary visitors weigh heavier than average to Linux.

Windows 7 14,989 46.2%
Linux 4,029 12.4%
Windows XP 3,876 12%
Mac OS 3,473 10.7%
Windows 8
1,707 5.3%

Chrome dominates the browser choice for WikiApiary visitors.

Chrome 26.0 2,377 7.3%
Safari 6.0 2,319 7.2%
Chrome 30.0 2,087 6.4%
Chrome 28.0 1,976 6.1%
Chrome 27.0
1,756 5.4%

2,455 of the 32,416 (7.6%) visits to WikiApiary were from logged in users. All websites statistics are from the amazing Piwik project. No data is shared with Google or other search engines.

WikiApiary is largely about graphs, so that seems like a logical way to explore the first year of WikiApiary. The number of active users on WikiApiary has roughly been around 30 for most of the year, peaking over 50. This doesn’t seem like a ton, but most wikis that are monitored actually have fewer than 5 active users. The total number of users is over 250 and grows steadily. Those are all real accounts too, no spam accounts. Registration is required to edit so this is a good reflection of engagement.

WikiApiary-users

Edit activity on WikiApiary is mostly robotic. The bots are constantly tending to the data set and they do this with edits. You can see the edit rate jumped in October after I added tracking for MaxMind geo data as well as Whois records for wikis. Over 5 million edits in the first year.

WikiApiary-edits

Total pages of content largely reflect the number of wikis being tracked, plus the number of extensions and skins that exist. Notably you can see the initial load of sites in February and March. There were additional farmer bots that added in some reasonably sized farms in June. In October the pages spike again with the addition of more datasets.

WikiApiary-pages-articles

WikiApiary is itself the 11th largest Semantic MediaWiki installation that it tracks. The largest is Gyvosios gamtos enciklopedija with over 16 million properties (think of a property as a data value). WikiApiary has over 3.3 million property values.

WikiApiary-property-count

These 3.3 million properties are queried in MediaWiki templates so you can see the data. There are nearly 140,000 queries in WikiApiary.

WikiApiary-query-count

Special Thanks

WikiApiary has had a lot of contributions with additions of wikis and help with templates and bots. Karsten Hoffmeyer has been a huge part of WikiApiary and is also an administrator on the site. Karsten helps with adding wikis and fending off the occasional bad edits. WikiApiary also has a very distinctive look from the Foreground skin which was built by my friend Garrick van Buren. Mark Hershberger has also been an active part of WikiApiary and is exploring ways that MediaWiki installs can automatically add themselves to WikiApiary. Huge thanks to Frederico Leva (Nemo) for linking extension pages on MediaWiki.org to their respective page on WikiApiary. This drives a lot of exposure for WikiApiary and provides great value to visitors of MediaWiki.org. A big thank you also to Paul DeCoursey who rewrote the Javascript code to embed the charts into the pages, and support multiple charts with usable controls.

Also, I think one of the things that makes WikiApiary unique is that it is built with MediaWiki and Semantic MediaWiki and the related suite of extensions. This is such a wonderful set of software and a special thanks to James HK, Jeroen De Dauw and Yaron Koren. All of them have helped out and provided input on WikiApiary at times in the first year.

Future Plans

I’ve got a ton of plans for WikiApiary, and I keep picking them off slowly. I’ve not had much time for the project the last couple of months but whatever time I have had has been going into rewriting the bots. The first versions where just hacked up and difficult to understand. I’m working on a rewrite that includes unit tests, a good object model and code that is easy enough to understand that I hope to get some more contributors involved. The other huge thing being added is parallel requests. Right now WikiApiary is limited in collecting from more sites due to how it collects, in serial. The new bots will do that in parallel and will dramatically change the cost of running a collection sequence. There should be no problem going from 10,000 to 100,000 or more wikis being monitored.

I would also like to see the Honey Bee MediaWiki extension get going which will be the first step of an extension that leverages WikiApiary inside of the wiki it’s running in.

Additionally I’d like to do a whole deeper level of analysis of MediaWiki websites and have been contacted by two groups who have written algorithms that do this and are interested in adding their code to WikiApiary. I hope to make that easier with the bot rewrite mentioned above.

I also want to provide a base Farmer class that can easily be extended so that bots that farm new wikis into WikiApiary are easier. My big objective is to finally pull in Wikia.

I’m proud of WikiApiary and plan on continuing to host it (no small feat actually, given its scale) and work on it. I see WikiApiary as one of my “decade project” so I don’t have to move too fast. I just keep things rolling the right way.

Here’s Everywhere You Should Enable Two-Factor Authentication Right Now

Two-factor authentication is one of the best things you can do to make sure your accounts don’t get hacked. We’ve talked about it a bit before, but here’s a list of all the popular services that offer it, and where you should go to turn it on right now.

This is a great list. Highly recommended to go and enable all of these that are relevant to you.

Productivity Apps for iPad

A bunch of people at work got new iPad’s recently and I figured some may find it useful to know what productivity apps I’ve been using on my iPad. Over the last couple of months the iPad has really become a critical tool for me.

OmniFocus for iPad

I use OmniFocus all day, constantly on my iPhone, iPad and Mac. Everything I need to do gets put in it. It’s a GTD-style task tool. YMMV, but I find it very useful. OmniFocus sits in my home row on all iOS devices.

OmniOutliner 2

If I’m taking notes in a meeting I’m always keeping them in an outline, and a proper outliner is a huge win for that. I also use OmniOutliner 2 to frame out any presentations I’m giving. One thing I particularly like to do is have an outline document for each recurring meeting, and then have a “node” in the outline for each week where I put my notes. Makes it super easy to see last meeting items.

HipChat

I’ve been using HipChat for team chat. It’s awesome, and the iPad version lets me stay connected all the time. Great tool.

MindNode

If you are at all into mind mapping this is the best one by a mile for the iPad. Great tool for brainstorming a topic.

1Password

I use 1Password to manage 400+ unique passwords for everything I use. Critical to have it on iOS as well.

Skype

An obvious one.

Keynote

I know most folks are PowerPoint junkies. Personally if I can I prefer Keynote. The iOS version is on par with the desktop one. Enjoyable to work on slides on a plane using this.

PCalc

The calculator builtin is no fun at all. This is a serious, heavy hitter calculator that supports RPN for folks that are into that sort of thing. I don’t know if it does the beloved 12C functions for all the finance folks, but this is my preferred calculator by a mile. Nice bonus is you can email the “tape” if you’ve done a set of calculations.

OmniPlan 2

Like Microsoft Project but for the iPad. If you need to do any standalone project planning this is a great tool.

iA Writer

If you are brave enough to write long form text on your iPad, this is my preferred tool for that. Has great keyboard setup that makes writing longer form things much better.

Dropbox

Pretty much required.

Tweetbot

If your gonna Twitter, ditch the Twitter official client for Tweetbot. It’s about a million times better. At some point a new iOS 7 optimized Tweetbot will come out for the iPad, it’s already out for the iPhone, but until then this current version is the best out there.

Password v. Passphrase

Today’s xkcd highlights what might be one of the biggest security mistakes of our digital age: using passwords over passphrases. The comic does a great job of illustrating it. I’ve mentioned before that I use 1Password to store over 400 unique, random passwords. But how do I unlock 1Password? A passphrase that is long. Very long.

Sadly being an active 1Password user I can highlight that many websites fail if you use passwords that are greater than 20 characters, much less a 40 or 50 character passphrase.

Paging Large Datasets in Semantic MediaWiki

I’ve been working with Yaron Koren‘s new Miga project to provide an offline, mobile optimized way of accessing the nutrition data in Wikinosh. It works pretty well! However, we hit a couple of walls trying to get the 20,000+ food items in Wikinosh exported for use in Miga. If you are trying to page thru large data sets in Semantic MediaWiki read on.

Large Offsets

The best way to do this is to use queries of 500 items at a time and use incrementing offsets to retrieve each set of rows. This works really well and is easy to code. However, when you run this the first time if you have over 5,000 records you will notice that your while loop will never finish. Very odd. When you look at the data you will also see that after a few thousand rows, the first set of data just repeats over and over and you never reach the end.

A quick look in SMW_QueryProcessor.php on line 611 shows this:

$params['offset'] = array(
    'type' => 'integer',
    'default' => 0,
    'negatives' => false,
    'upperbound' => 5000 // TODO: make setting
 );

Now that makes sense. It turns out that if your offset value is greater than 5,000 it won’t be used. To make matters worse, this doesn’t generate an error it simply ignores the offset. Ugh! The “TODO” is still a to do, there is no setting for this. If your working with large data, you will want to increase the offset to something large enough to accommodate your largest set. Since Wikinosh has 21,000+ items, I set it to 30000.

Query Max

Now that you have your offset changed you can go ahead and run your export again. It will chug along and then hit 10,000 records and stop. What? Sure, you increased your offset however you probably still have the default $smwgQMaxLimit value of 10000. So, if your offset is greater than the query max, you get 0 rows and your done. This setting does have a way to override, set $smwgQMaxLimit = 30000; in your LocalSettings.php and you will be ready to go.

With these two changes in place, I can now easily page through 500 records at a time and get everything out exactly as I expect.

There are currently two bugs open on these behaviors. Check those for current status of these settings: SMW Ask query offset has a hardcoded limit and SMW Ask query offset should error if maximum offset is exceeded.

Stopping MediaWiki Spam with Dynamic Questy Captchas

MediaWiki websites are often plagued by spammers. It’s annoying in the extreme. If you setup a blank MediaWiki website and do nothing it is likely that within a couple of weeks your site will be found, and in a matter of days you will have thousands of spam user accounts and tens of thousands of pages of spam. There are a number of ways to stop wikispam. I tried using Recaptcha to little benefit. I still got a large number of spam registrations on my publicly available wikis. I’ve found the combination below to be incredible efficient.

I started to use QuestyCaptcha which is a plugin for the ConfirmEdit extension (WikiApiary) which uses a simple question/answer paradigm and that worked well. However, the hard part with Questy is figuring out what questions to use. Particularly if your wiki is global, you need to avoid using questions that are specific to one culture or location, or even language. What color is the sky? Well, in what language? For WikiApiary I want to make sure that people from anywhere are able to register. I started with simple questions, like “What is the name of this website?” but that was quickly defeated and spam registrations started showing up. What to do?

I decided to see if QuestyCaptcha could accommodate dynamic questions and it can! So, the first thing I did was decide to use a question that required someone to know something generic, but could also be easily found with a single click on a URL. My choice was to ask about the GMT time, because if you search for “gmt time” on Google, it tells you the answer. The PHP function gmdate will also give the answer. So I created two questions that asked for the day of the week and the hour (24 hour time) at GMT, and provide hyperlinks to the answer. This worked great!

Then I decided to go a little further and ask a very dynamic question. This time I generated an 8 character random string, and ask the user to identify one of the characters in the string. No language issue! No culture challenges! Simple. Here is the code for both of these solutions as it would appear in your LocalSettings.php file.

# Let's stop MediaWiki registration spam
require_once( "$IP/extensions/ConfirmEdit/ConfirmEdit.php" );
require_once("$IP/extensions/ConfirmEdit/QuestyCaptcha.php");
$wgCaptchaClass = 'QuestyCaptcha';
 
# Set questions for Questy
# First a couple that can be answered with a linked to Google search
$wgCaptchaQuestions[] = array (
    'question' => "What day of the week is it at <a href='http://google.com/search?q=gmt+time'>Greenwich Mean Time</a> (GMT) right now?",
    'answer' => gmdate("l")
);
$wgCaptchaQuestions[] = array (
    'question' => "In 24-hour format, what hour is it in <a href='http://google.com/search?q=gmt+time'>Greenwich Mean Time</a> (GMT) right now?",
    'answer' => gmdate("G")
);
 
# Now a more complicated one
# Generate a random string 8 characters long
$myChallengeString = substr(md5(uniqid(mt_rand(), true)), 0, 8);
# Pick a random location in those 8 strings
$myChallengeIndex = rand(0, 7) + 1;
# Let's use words to describe the position, just to make it a bit more complicated
$myChallengePositions = array ('first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth');
$myChallengePositionName = $myChallengePositions[$myChallengeIndex - 1];
# Build the question/anwer
$wgCaptchaQuestions[] = array (
    'question' => "Please provide the $myChallengePositionName character from the sequence <code>$myChallengeString</code>:",
    'answer' => $myChallengeString[$myChallengeIndex - 1]
);
 
# Skip CAPTCHA for people who have confirmed emails
$wgGroupPermissions['emailconfirmed']['skipcaptcha'] = true;
$ceAllowConfirmedEmail = true;

After putting these in place I’ve had nearly zero spam registrations (1 or 2 were clearly done by a human testing it). Now, can this be broken? Sure, easily. But not nearly as easy as static questions that could be harvested by a person and then put into a tool to automatically create accounts. In order to attack me, spammers would have to write a special handler that dealt with the randomness of the questions. This is very unlikely.

Feel free to use these examples, or, use other dynamic question/answer combinations. It’s not obvious that this type of configuration works with QuestyCaptcha, but it does and it allows for very powerful spam blocking.

Older posts

© 2014 thingelstad

Theme by Anders NorenUp ↑