The House of Commons

Tuesday, February 19, 2008

I think I found a trump card (update: no, I didn't)

(following on from this post)

http://www.archive.org/web/researcher/intended_users.php

I'll certainly be looking into this further.

(update: On further investigation, it doesn't look so good. http://www.archive.org/web/researcher/researcher.php says:

We are in the process of redesigning our researcher web interface. During this time we regret that we will not be able to process any new researcher requests. Please see if existing tools such as the Wayback Machine can accommodate your needs. Otherwise, check back with us in 3 months for an update.

This seems understandable except for this, on the same page:

This material has been retained for reference and was current information as of late 2002.

That's over 5 years. And in Internet time, that seems like a lifetime. I'll keep investigating.)

Labels: ben, open access, quantification

(permalink) posted by Ben Bildstein @ Tuesday, February 19, 2008 1 comments links to this post

The problem with search engines

There is a problem with search engines at the moment. Not any one in particular - I'm not saying Google has a problem. Google seems to be doing what they do really well. Actually, the problem is not so much something that is being done wrong, but something that is just not being done. Now, if you'll bear with me for a moment...

The very basics of web search

Web search engines, like Google, Yahoo, Live, etc., are made up of a few technologies:

Web crawling - downloading web pages; discovering new web pages
Indexing - like the index in a book: figure out which pages have which features (meaning keywords, though there may be others), and store them in separate lists for later access
Performing searches - when someone wants to do a keyword search, for example, the search engine can look up the keywords in the index, and find out which pages are relevant

None of these is trivial. I'm no expert, but I suggest indexing is the easiest. Performing searches well is what made Google so successful, where previous search engines had been treating the search step more trivially.

But what I'm interested in here is web crawling. Perhaps that has something to do with the fact that online commons quantification doesn't require indexing or performing searches. But bear with me - I think it's more than that.

A bit more about the web crawler

There are lots of tricky technical issues about how to do the best crawl - to cover as many pages as possible, to have the most relevant pages possible, to maintain the latest version of the pages. But I'm not worried about this now. I'm just talking about the fundamental problem of downloading web pages for later use.

Anyone who is reading this and hasn't thought about the insides of search engines before is probably wondering at the sheer amount of downloading of web pages required, and storing them. And you should be.

They're all downloading the same data

So a single search engine basically has to download the whole web? Well, some certainly have to try. Google, Yahoo and Live are trying. I don't know how many others are trying, and many of them may not be publicly using their data so we may not see them. There clearly are more at least than I've ever heard of - take a look at Wikipedia's robots.txt file: http://en.wikipedia.org/robots.txt.

My point is why does everyone have to download the same data? Why isn't there some open crawler somewhere that's doing it all for everyone, and then presenting that data through some simple interface? I have a personal belief that when someone says 'should', you should* be critical in listening to them. I'm not saying here that Google should give away their data - it would have to be worth $millions to them. I'm not saying anyone else should be giving away all their data. But I am saying that there should be someone doing this, from an economic point of view - everyone is downloading the same data, and there's a cost to doing that, and the cost would be smaller if they could get together and share their data.

Here's what I'd like to see specifically:

A good web crawler, crawling the web and thus keeping an up-to-date cache of the best parts of the web
An interface that lets you download this data, or diffs from a previous time
An interface that lets you download just some. E.g. "give me everything you've got from cyberlawcentre.org/unlocking-ip" or "give me everything you've got from *.au (Australian registered domains)" or even "give me everything you've got that links to http://labs.creativecommons.org/licenses/zero-assert/1.0/us/"
Note that in these 'interface' points, I'm talking about downloading data in some raw format, that you can then use to, say, index and search with your own search engine.

If you know somewhere this is happening, let me know, because I can't find it. I think the Wayback Machine is the closest to an open access Web cache, and http://archive-access.sourceforge.net/ is the closest I've found to generalised access to the Wayback Machine. I'll read more about it, and let you know if it comes up trumps.

* I know.

Labels: ben, open access, quantification, search

(permalink) posted by Ben Bildstein @ Tuesday, February 19, 2008 2 comments links to this post

Thursday, May 10, 2007

New Guides from the OAK Law Project

The OAK Law Project, based at the Queensland University of Technology, has followed up its thought-provoking first OAK Law report with two new guides focusing on digital copyright issues.

The first, ‘A Guide to Developing Open Access Through Your Digital Repository’, is aimed at helping users understand the issues in developing and building open access digital repositories.

The second hits quite close to home for both myself and my fellow housemate Ben. Titled 'Copyright Guide for Research Students: What you need to know about copyright before depositing your electronic thesis in an online repository', its aim is to assist research students in understanding their copyright rights, obligations and responsibilities when adding their theses to digital respositories.

Both guides are licensed under Creative Commons Australia licences. I'm thrilled at the release of these reports. Even as an individual whose thesis focuses on aspects of copyright law, there are so many copyright questions that arise in relation to your thesis, publishing, digital repositories etc. that it's hard to keep track of them all!

Labels: catherine, open access

(permalink) posted by Catherine Bond @ Thursday, May 10, 2007 0 comments links to this post

Tuesday, February 19, 2008

I think I found a trump card (update: no, I didn't)

The problem with search engines

Thursday, May 10, 2007

New Guides from the OAK Law Project

Contributors

On this page

Supporters

Archives

IP blogosphere