The House of Commons: A Challenge to Search Engines

Wednesday, November 29, 2006

A Challenge to Search Engines

The spark

I've recently been redrafting my Unlocking IP conference paper for publication, and it got me to thinking. Google (as did Nutch, when it was online at creativecommons.org) has a feature for searching for Creative Commons works, as part of Advanced Search. Graham Greenleaf, my supervisor, asked me the other day how they keep up with all the various licences - there are literally hundreds of them when you consider all the various combinations of features, the many version numbers, and the multitude of jurisdictions. Yahoo, for example, only searches for things licensed under the American 'generic' licences, and no others. But Google seems to be able to find all sorts.

Now I'm not one to doubt Google's resources, and it could well be that as soon as a new licence comes out they're all over it and it's reflected in their Advanced Search instantaneously. Or, more likely, they have good communication channels open with Creative Commons.

But it did occur to me that if I were doing it - just little old me all by myself, I'd try to develop a method that didn't need to know about the new licences to be able to find them.

As detailed in my paper, it appears that Google's Creative Commons search is based on embedded metadata. As I have said previously, I understand this standpoint, because it is, if nothing else, unambiguous (compared with linking to licences for example, which generates many false positives).

So if I were doing it, I'd pick out the metadata that's needed to decide if something is available for commercial use, or allows modification, or whatever, and I'd ignore the bit about which licence version was being used, and its jurisdiction, and those sorts of details that the person doing the search hasn't even been given the option to specify.

The challenge

Anyway, the challenge. I have created a web page that I hereby make available for everyone to reproduce, distribute, and modify. In keeping with the Creative Commons framework I require attribution and licence notification, but that's just a formality I'm not really interested in. I've put metadata in it that describes these rights, and it refers to this post's permalink as its licence. The web page is up, and by linking to it from this post, it's now part of the Web.

The challenge is simply this: Which search engines will find it and classify it as commons, creative or otherwise. Will I fool search engines into thinking it's Creative Commons content? Or will they look straight past it? Or will they rise to the challenge and see that what Creative Commons has started is bigger than just their licences, and the Semantic Web may as well be for every licence.

Let's give them a little time to index these pages, and we'll find out.

[post script: I've added a properly Creative Commons licensed page for comparison, that we can use to see when the search engines have come by (this page at least should turn up in search results).]

Labels: ben, search

(permalink) posted by Ben Bildstein @ Wednesday, November 29, 2006

Comments:

Ben Bildstein said:
While I'm on the subject, the mozCC extension for Firefox (which I use) recognises my licensed page as CC-by, i.e. available under an attribution-only licence. It's also smart enough to recognise my blog post (this post) as the licence, as I intended. The only hickup is that it didn't quite understand what I meant when I said that the HTML source of the page was what I was licensing. But I forgive.

So, if by some unexpected happenstance some search engine out there is using the same code as mozCC, we will already have a winner.

(permalink) posted by

Ben Bildstein : 7:11 PM, November 29, 2006

Wednesday, November 29, 2006

A Challenge to Search Engines

Contributors

On this page

Supporters

Archives

IP blogosphere