Tuesday, February 19, 2008
(following on from this post)
http://www.archive.org/web/researcher/intended_users.php
I'll certainly be looking into this further.
(update: On further investigation, it doesn't look so good. http://www.archive.org/web/researcher/researcher.php says:
http://www.archive.org/web/researcher/intended_users.php
I'll certainly be looking into this further.
(update: On further investigation, it doesn't look so good. http://www.archive.org/web/researcher/researcher.php says:
We are in the process of redesigning our researcher web interface. During this time we regret that we will not be able to process any new researcher requests. Please see if existing tools such as the Wayback Machine can accommodate your needs. Otherwise, check back with us in 3 months for an update.This seems understandable except for this, on the same page:
This material has been retained for reference and was current information as of late 2002.That's over 5 years. And in Internet time, that seems like a lifetime. I'll keep investigating.)
Labels: ben, open access, quantification
The Government is conducting a review of the recently introduced format shifting exceptions in the Copyright Act (47J and 110AA). The review is required by the Copyright Amendment Act 2006. The Attorney-General's Department has released an issues paper inviting submissions on the operation of these provisions. More information is available here.
Submissions are due just around the corner (29 February) - so get submitting!
Submissions are due just around the corner (29 February) - so get submitting!
Labels: abi, exceptions, review
Earlier in the week SMH reported that the Government is considering forcing ISPs to disconnect users who access pirated material (three strikes and you're out, UK style).
Kim Weatherall has done an excellent overview of the problems with this approach.
Kim Weatherall has done an excellent overview of the problems with this approach.
Labels: abi, enforcement, piracy
There is a problem with search engines at the moment. Not any one in particular - I'm not saying Google has a problem. Google seems to be doing what they do really well. Actually, the problem is not so much something that is being done wrong, but something that is just not being done. Now, if you'll bear with me for a moment...
The very basics of web search
Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:
But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.
A bit more about the web crawler
There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.
Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.
They're all downloading the same data
So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.
My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.
Here's what I'd like to see specifically:
* I know.
The very basics of web search
Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:
- Web crawling - downloading web pages; discovering new web pages
- Indexing - like the index in a book: figure out which pages have which features (meaning keywords, though there may be others), and store them in separate lists for later access
- Performing searches - when someone wants to do a keyword search, for example, the search engine can look up the keywords in the index, and find out which pages are relevant
But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.
A bit more about the web crawler
There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.
Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.
They're all downloading the same data
So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.
My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.
Here's what I'd like to see specifically:
- A good web crawler, crawling the web and thus keeping an up-to-date cache of the best parts of the web
- An interface that lets you download this data, or diffs from a previous time
- An interface that lets you download just some. E.g. "give me everything you've got from cyberlawcentre.org/unlocking-ip" or "give me everything you've got from *.au (Australian registered domains)" or even "give me everything you've got that links to http://labs.creativecommons.org/licenses/zero-assert/1.0/us/"
- Note that in these 'interface' points, I'm talking about downloading data in some raw format, that you can then use to, say, index and search with your own search engine.
* I know.
Labels: ben, open access, quantification, search
Monday, February 18, 2008
As you aren't aware*, Creative Commons has some data on the quantification of Creative Commons licence usage (collected using search engine queries). It's great that they are a) collecting this data, and b) sharing it freely.
If you look around, you can probably find some graphs based on this data, and that's probably interesting in itself. Tomorrow I'll see about dusting off my Perl skills, and hopefully come up with a graph of the growth of Australian CC licence usage. Stay tuned.
* If you knew about this, why didn't you tell me!
If you look around, you can probably find some graphs based on this data, and that's probably interesting in itself. Tomorrow I'll see about dusting off my Perl skills, and hopefully come up with a graph of the growth of Australian CC licence usage. Stay tuned.
* If you knew about this, why didn't you tell me!
Labels: ben, Creative Commons, licensing, quantification
Thursday, February 14, 2008
Creative Commons have come up with a better way for people to mark works as copyright-free, or part of the public domain. It's called CC0 (CC Zero). The page for using it is here: http://labs.creativecommons.org/license/zero.
There are two options here. The first is a waiver, where you can "waive all copyrights and related or neighboring interests that you have over a work". The second is an assertion, where you can "assert that a work is free of copyright as a matter of fact, for example, because the work is old enough to be in the public domain, or because the work is in a class not protected by copyright such as U.S. government works."
It's pretty neat. I've thought the idea of asserting a work's copyright status, as a matter of fact, was a good idea, and not just limited to the public domain, but also for other classes of usage rights.
Okay, so that's basically the CC0 story. I've tried it out with a trivial web page I think I would otherwise have copyright in - the result is at the bottom of this post. But I must say I'm slightly disappointed in the lack of embedded metadata. Where's the RDF? As I've talked about before, when you do things with RDF, you allow sufficiently cool search engines to understand your new technology (or licence) simply by seeing it, without first having to be told about it.
There are two options here. The first is a waiver, where you can "waive all copyrights and related or neighboring interests that you have over a work". The second is an assertion, where you can "assert that a work is free of copyright as a matter of fact, for example, because the work is old enough to be in the public domain, or because the work is in a class not protected by copyright such as U.S. government works."
It's pretty neat. I've thought the idea of asserting a work's copyright status, as a matter of fact, was a good idea, and not just limited to the public domain, but also for other classes of usage rights.
Okay, so that's basically the CC0 story. I've tried it out with a trivial web page I think I would otherwise have copyright in - the result is at the bottom of this post. But I must say I'm slightly disappointed in the lack of embedded metadata. Where's the RDF? As I've talked about before, when you do things with RDF, you allow sufficiently cool search engines to understand your new technology (or licence) simply by seeing it, without first having to be told about it.
Here's my example waiver:
To the extent possible under law,
Ben Bildstein has waived all copyright, moral rights, database rights, and any other rights that might be asserted over Sensei's Library: Bildstein/Votes.
Labels: ben, Creative Commons, search