Tuesday, October 17, 2006
(This is part 3 of a series of posts on semantic web search for commons. For the previous posts, see here: part 1, part 2)
Another possible expansion to the proposed semantic web search engine is to include non-web documents hosted on the web. Google, for example, already provides this kind of search with Google Image Search, where although Google can not distil meaning from the images that are available, it can discover enough from their textual context that they can be meaningfully searched.
Actually, from the perspective of semantic web search as discussed in this series, images need not be considered, because if an image is included in a licensed web page, then that image can be considered licensed. The problem is actually more to do with non-html documents that are linked-to from html web pages, hence where the search engine has direct access to them but prima facie has no knowledge of their usage rights.
This can be tackled from multiple perspectives. If the document can be rendered as text (i.e. it is some format of text document) then any other licence mechanism detection features can be applied to it (for example, examining similarity to known text-based licences). If the document is an archive file (such as a zip or tar), then any file in the archive that is an impromptu licence could indicate licensing for the whole archive. Also, any known RDF – especially RDF embedded in pages that link to the document – that makes usage-rights statements about the document can be considered to indicate licensing.
Public domain works
The last area I will consider, which is also the hardest, is that of public domain works. These works are hard to identify because they need no technical mechanism to demonstrate that they are available for public use – simply being created sufficiently long ago is all that is needed. Because the age of the web is so much less than the current copyright term, only a small portion of the actual public domain is available online, but some significant effort has been and is continuing to be made to make public domain books available in electronic form.
The simplest starting point for tackling this issue is Creative Commons, where there is a so-called 'public domain dedication' statement that can be linked to and referred to in RDF to indicate either that the author dedicates the work to the public domain, or that it is already a public domain work (the latter is a de facto standard, although not officially promoted by Creative Commons). Both of these fit easily into the framework so far discussed.
Beyond this, it gets very difficult to establish that a web page or other document is not under copyright. Because there is no (known) standard way of stating that a work is in the public domain, the best strategy is likely to be to establish the copyright date (date of authorship) in some way, and then to infer public domain status. This may be possible by identifying copyright notices such as "Copyright (c) 2006 Ben Bildstein. All Rights Reserved." Another possibility may be to identify authors of works, if possible, and then compare those authors with a database of known authors' death dates.
But wait, there’s more
In my next and last post for this series, I will consider where Google is up to with the tackling of these issues, and consider the problems in the framework set out here. Stay tuned.