Thursday, July 12, 2007
Today I just wanted to take a brief look at some of the problems I'm finding myself tackling with the deep web commons question. There's two main ones, from quite different perspectives, but for both I can briefly describe my current thoughts.
Flattening a web form
The first problem is that of how to represent a web form in such a way that it can be used as an input to an automated system that can evaluate it. Ideally, in machine learning, you have a set of attributes that form a vector, and then you use that as the input to your algorithm. Like in tic-tac-toe, you might represent a cross by -1, a naught by +1, and an empty space by 0, and then the game can be represented by 9 of these 'attributes'.
But for web forms it's not that simple. There are a few parts of the web form that are different from each other. I've identified these potentially useful places, of which there may be one or more, and all of which take the form of text. These are just the ones I needed when considering Advanced Google Code Search:
But as far as I can tell, text makes for bad attributes. Numerical is much better. As far as I can tell. But I'll talk about that more when I talk about ripple down rules.
A handful of needles in a field of haystacks
The other problem is more about what we're actually looking for. We're talking about web forms that hide commons content. Well the interesting this about that is that there's bound to be very few, compared to the rest of the web forms on the Internet. Heck, they're not even all for searching. Some are for buying things. Some are polls.
And so, if, as seems likely, most web forms are uninteresting, if we need to enlist an expert to train the system, the expert is going to be spending most of the time looking at uninteresting examples.
This makes it harder, but in an interesting way: if I can find some way to have the system, while it's in training, find the most likely candidate of all the possible candidates, it could solve this problem. And that would be pretty neat.
Flattening a web form
The first problem is that of how to represent a web form in such a way that it can be used as an input to an automated system that can evaluate it. Ideally, in machine learning, you have a set of attributes that form a vector, and then you use that as the input to your algorithm. Like in tic-tac-toe, you might represent a cross by -1, a naught by +1, and an empty space by 0, and then the game can be represented by 9 of these 'attributes'.
But for web forms it's not that simple. There are a few parts of the web form that are different from each other. I've identified these potentially useful places, of which there may be one or more, and all of which take the form of text. These are just the ones I needed when considering Advanced Google Code Search:
- Form text. The actual text of the web form. E.g. "Advanced Code Search About Google Code Search Find results with the regular..."
- Select options. Options in drop-down boxes. E.g. "any language", "Ada", "AppleScript", etc.
- Field names. Underlying names of the various fields. E.g. "as_license_restrict", "as_license", "as_package".
- Result text. The text of each search result. E.g. (if you search for "commons"): "shibboleth-1.3.2-install/.../WrappedLog.java - 8 identical 26: package..."
- Result link name. Hyperlinks in the search results. E.g. "8 identical", "Apache"
But as far as I can tell, text makes for bad attributes. Numerical is much better. As far as I can tell. But I'll talk about that more when I talk about ripple down rules.
A handful of needles in a field of haystacks
The other problem is more about what we're actually looking for. We're talking about web forms that hide commons content. Well the interesting this about that is that there's bound to be very few, compared to the rest of the web forms on the Internet. Heck, they're not even all for searching. Some are for buying things. Some are polls.
And so, if, as seems likely, most web forms are uninteresting, if we need to enlist an expert to train the system, the expert is going to be spending most of the time looking at uninteresting examples.
This makes it harder, but in an interesting way: if I can find some way to have the system, while it's in training, find the most likely candidate of all the possible candidates, it could solve this problem. And that would be pretty neat.
Labels: ben, deep web, research