Why Wikipedia Should Go P2P
One of the nicest things about Firefox is its support for search engine plugins. They let you search on different websites from a single edit box in the top-right corner of the browser. I have seven different plugins installed: Dictionary.com, Cambridge French < -> English dictionary, Internet Movie Database, Google, Wikipedia, Slovnik seznam (Czech < -> English) and Quote-O-Matic. I use all of them frequently. (Well okay, not Quote-O-Matic, but it seems that once you install search plugins you can’t easily remove them.)
The whole point of this feature is that you save a whole extra step when you need to search for something. Instead of having to lumber over to the IMDB website and locate the search form, you just choose IMDB in a handy list and off you go. But lately when I search Wikipedia, a webpage comes up exclaiming “Sorry! Full text search has been disabled for performance reasons” and offers to run the search through Google or Yahoo’s search engine. Kind of defeats the purpose, doesn’t it?
What’s going on here says something intriguing about free software and modern networked architectures. Why is it that Google and Yahoo are able to handle Wikipedia and the rest of the World Wide-friggin’-Web without working up a sweat, while Wikipedia can’t even handle its own traffic? Could it be the fact that it doesn’t make any money?
This is where the ideals of the information-to-the-masses set collide against the cold hard realities of capitalistism. Open source software sometimes benefits from corporate sponsorships, but primarily from the donated time and energy of legions of restless hackers. In the same way, through the work of thousands of volunteers Wikipedia has managed since its inception in January 2001 to eclipse print-based encyclopedias that have been around for hundreds of years. Unfortunately, offering your creative talents is not the same thing as giving away a full-blown server farm, so the hardware capabilities of Wikipedia have trailed the richness of its content and its increasing popularity.
It’s kind of sad that Wikipedia can get so many people to invest hours crafting articles purely to bask in the glory of seeing them shared with world, but they can’t afford to power their own site. One solution might be to abandon the “purer than the driven snow” attitude that leads them to shun any kind of commercial sponsorship or advertisements. I’m not crazy about irritating ads flitting across the screen, but if it means I can search properly then I might just put up with it.
Much truer to the Wikipedia ethos would be an architecture that lets people donate their processing power in the same way that they donate their writing skils: in small doses. The perfect way to achieve this is through a peer-to-peer architecture. Though normally thought of as a distribution mechanism for very large files, the approach is just as applicable to very popular files (including webpages). Wikipedia has more semantic information available to it than say, Google, to let it index its contents intelligently, so you get much more precise results when you search it directly. With so many devoted contributors and readers to draw on, this index could be spread around the network so that it scales to handle the increased load as more users come on board.
One clever way to do this is to use a network architecture known as a Distributed Hash Table (DHT). The basic idea is that any piece of content can be associated with a number by using a so-called hash function. Hash functions turn any file, no matter how large, into a single, relatively short number (16 bytes would be typical) that is pretty near certain to be unique to that file. Every computer in the network holds files associated with some range of these numbers. They also have a list of pointers to other machines and their associated hash numbers. The clever bit is that don’t point to, say, the next ten computers in the sequence. Instead each points to the 1st, 2nd, 4th, 8th, 16th and so on computer following it. This means that searches take log(n) time, the same as a binary tree.
In English: to find a file in a network of 30 computers I would have to check around 5 computers. With 1,000 computers I would have to check around 10. With 1,000,000 computers, around 20.
So in the end effect you have a totally scalable, totally free network infrastructure dependent on contributions from its users. Sounds pretty Wikipedian, doesn’t it? The only problem is that it’s tricky to make it work properly. More on that later.
3 Comments »
Trackback URL RSS feed for comments on this post. TrackBack URI
Leave a comment
Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>







What have to say is that a P2P filesystem tecnology do not exists yet. Only in poor prototypes.
Comment by Lucas — 3/12/2005 @ 5:56 pm
Engineering Document Management Software
Not all electronic documents are the same. Remember that skit from Sesame Street called “which one of these things is not like the other”? Well, engineering documents are the pink apple and they stand out from all other electronic files. CAD files ca…
Trackback by Managed By Cambridge Search Engine — 3/15/2008 @ 11:21 pm
Hi everyone, thought this might me the appropriate place to mention FilesWire.com. Its an online web-based p2p client. Theres nothing to download or install so you can share instantly. Visit to begin
Comment by james — 4/29/2008 @ 3:12 pm