Monday, May 19, 2008
I said recently that defining the Australian web is an issue in itself. I thought I'd say a little more about how the National Library's crawls handled the issue.
First, the National Library's crawls were outsourced to the Internet Archive, which is a good thing - it's been done well, the data is in a well defined format (a few sharp edges, but pretty good), and there's a decent knowledge-base out there already for accessing this data.
Now, there are two ways that IA chooses to include a page as Australian:
Actually, there is a third kind of page in the crawls. The crawls were done with a setting that included some pages linked directly from Australian pages (example: slashdot.org), though not sub-pages of these. I'll have to address this, and I can think of a few ways:
(Thanks to Alex Osborne and Paul Koerbin from the National Library for detailing the specifics for me)
First, the National Library's crawls were outsourced to the Internet Archive, which is a good thing - it's been done well, the data is in a well defined format (a few sharp edges, but pretty good), and there's a decent knowledge-base out there already for accessing this data.
Now, there are two ways that IA chooses to include a page as Australian:
- domain name ends in '.au' (e.g. all web pages on the unsw.edu.au domain)
- IP address is registered as Australian in a geolocation database
Actually, there is a third kind of page in the crawls. The crawls were done with a setting that included some pages linked directly from Australian pages (example: slashdot.org), though not sub-pages of these. I'll have to address this, and I can think of a few ways:
- Do a bit of geolocation myself
- Exclude pages where sibling pages aren't in the crawl
- Don't make national-oriented conclusions, or when I do, restrict to the .au domains
- Argue that it's a small portion so don't worry about it
(Thanks to Alex Osborne and Paul Koerbin from the National Library for detailing the specifics for me)
Labels: ben