Tuesday, October 17, 2006
In my last three posts (see part 1, part 2 and part 3), I have been exploring the potential for a search engine that focuses solely on commons content (i.e. works with expanded public rights). Now I’d like to touch on where this idea might fit into the broader scheme of things, including what its shortcomings might be.
At present, Google web search is the benchmark in Internet search. I consider Google to have almost unlimited resources, so for any given feature that Google web search lacks, it is worth considering why they do not have it. In general, I think the answer to this question is either that it will clutter their user interface, it will not be of significant use to the public, or it is computationally intractable. However, given that Google have many services, many in beta and many still in Google Labs, the first problem of cluttering the user interface can clearly be solved by making a new search a new service (for example, Google Scholar). This leaves only the possibilities of the feature not being of significant use, or being computationally intractable.
As this clearly demonstrates, there is no single or simple way to establish works as part of the commons. This may go some way to answering the question of why Google hasn't done this yet. However, Google has implemented at least three of these ideas: in the advanced search, Creative Commons-based usage rights can be specified; in the new Code Search (beta), various text-based licences are recognised by the system; and in Google Book Search (beta), authorship is used to establish public-domain status (in some small subset of Google's scanned books). Google hasn’t done anything all encompassing yet, but it may be that it’s just around the corner, or it may be that Google has figured out that it’s not really what the public want. Depending on how the big players (Google, but also Yahoo and others) proceed, my research may move more towards an analysis of either why so much was done by the big search engines (and what this means to the Unlocking IP project), or alternatively why so little was done (and what this means).
Lastly…
Lastly, I should consider what this idea of a semantic web search engine, as presented, is lacking. First, there is the issue that licence metadata (URLs, if nothing else) need to be entered by someone with such knowledge – the system can not discover these on its own. Second, there are the issues of false positives (web pages that are erroneously included in search results) and false negatives (suitable web pages that are erroneously excluded from search results). The former is the most obvious problem from a user's perspective, and my description here focuses mostly on avoiding these false positives. The latter problem of false negatives is much harder to avoid, and obviously is exacerbated by the system's incomplete domain knowledge, and the inevitable non-standard licensing that will be seen on the web.
Thanks for reading all the way to the end of my last post. If you got this far, stay tuned – I’m sure there’s more to come.