Tuesday, July 17, 2007
In a previous post, I talked about the problems of finding commons in the deep web; now I want to talk about some possible solutions to these problems.
Ripple Down Rules
Ripple down rules is a knowledge acquisition methodology developed at the University of New South Wales. It's really simple - it's about incrementally creating a kind of decision tree based on an expert identifying what's wrong with the current decision tree. It works because the expert only needs to justify their conclusion that the current system is wrong in a particular case, rather than identify a universal correction that needs to be made, and also the system is guaranteed to be consistent with the expert's evaluation of all previously seen data (though overfitting can obviously still be a problem).
The application of ripple down rules to deep web commons is simply this: once you have a general method for flattened web forms, you can use the flattened web form as input to the ripple down rules system and have the system decide if the web form hides commons.
But how do you create rules from a list of text strings without even a known size (for example, there could be any number of options in a select input (dropdown list), and any number of select inputs in a form). The old "IF weather = 'sunny' THEN play = 'tennis'" type of rule doesn't work. One solution is to make the rule conditions more like questions, with rules like "IF select-option contains-word 'license' THEN form = 'commons'" (this is a suitable rule for Advanced Google Code Search). Still, I'm not sure this is the best way to express conditions. To put it another way, I'm still not sure that extracting a list of strings, of indefinite length, is the right way to flatten the form (see this post). Contact me if you know of a better way.
A probabilistic approach?
As I have said, one of the most interesting issues I'm facing is the needle in a haystack problem, where we're searching for (probably) very few web forms that hide commons, in a very very big World Wide Web full of all kinds of web forms.
Of course computers are good at searching through lots of data, but here's the problem: while you're training your system, you need examples of the system being wrong, so you can correct it. But how do you know when it's wrong? Basically, you have to look at examples and see if you (or the expert) agree with the system. Now in this case we probably want to look through all the positives (interesting forms), so we can use any false positives (uninteresting forms) to train the system, but that will quickly train the system to be conservative, which has two drawbacks. Firstly, we'd rather it wasn't conservative because we'd be more likely to find more interesting forms. Secondly, because we'll be seeing less errors in the forms classified as interesting, we have less examples to use to train the system. And to find false negatives (interesting forms incorrectly classified as uninteresting), the expert has to search through all the examples the system doesn't currently think are interesting (and that's about as bad as having no system at all, and just browsing the web).
So the solution seems, to me, to be to change the system, so that it can identify the web form that it is most likely to be wrong about. Then we can get the most bang (corrections) for our buck (our expert's time). But how can anything like ripple down rules do that?
Probabilistic Ripple Down Rules
This is where I think the needle in a haystack problem can actually be an asset. I don't know how to make a system that can tell how close an example is to the boundary between interesting and uninteresting (the boundary doesn't really exist, even). But it will be a lot easier to make a system that predicts how likely an example is to be an interesting web form.
This way, if the most likely of the available examples is interesting, it will be worth looking at (of course), and if it's classified as not interesting, it's the most likely to have been incorrectly classified, and provide a useful training example.
I will talk about how it might be possible to extract probabilities from a ripple down rules system, but this post is long enough already, so I'll leave that for another post.
Ripple Down Rules
Ripple down rules is a knowledge acquisition methodology developed at the University of New South Wales. It's really simple - it's about incrementally creating a kind of decision tree based on an expert identifying what's wrong with the current decision tree. It works because the expert only needs to justify their conclusion that the current system is wrong in a particular case, rather than identify a universal correction that needs to be made, and also the system is guaranteed to be consistent with the expert's evaluation of all previously seen data (though overfitting can obviously still be a problem).
The application of ripple down rules to deep web commons is simply this: once you have a general method for flattened web forms, you can use the flattened web form as input to the ripple down rules system and have the system decide if the web form hides commons.
But how do you create rules from a list of text strings without even a known size (for example, there could be any number of options in a select input (dropdown list), and any number of select inputs in a form). The old "IF weather = 'sunny' THEN play = 'tennis'" type of rule doesn't work. One solution is to make the rule conditions more like questions, with rules like "IF select-option contains-word 'license' THEN form = 'commons'" (this is a suitable rule for Advanced Google Code Search). Still, I'm not sure this is the best way to express conditions. To put it another way, I'm still not sure that extracting a list of strings, of indefinite length, is the right way to flatten the form (see this post). Contact me if you know of a better way.
A probabilistic approach?
As I have said, one of the most interesting issues I'm facing is the needle in a haystack problem, where we're searching for (probably) very few web forms that hide commons, in a very very big World Wide Web full of all kinds of web forms.
Of course computers are good at searching through lots of data, but here's the problem: while you're training your system, you need examples of the system being wrong, so you can correct it. But how do you know when it's wrong? Basically, you have to look at examples and see if you (or the expert) agree with the system. Now in this case we probably want to look through all the positives (interesting forms), so we can use any false positives (uninteresting forms) to train the system, but that will quickly train the system to be conservative, which has two drawbacks. Firstly, we'd rather it wasn't conservative because we'd be more likely to find more interesting forms. Secondly, because we'll be seeing less errors in the forms classified as interesting, we have less examples to use to train the system. And to find false negatives (interesting forms incorrectly classified as uninteresting), the expert has to search through all the examples the system doesn't currently think are interesting (and that's about as bad as having no system at all, and just browsing the web).
So the solution seems, to me, to be to change the system, so that it can identify the web form that it is most likely to be wrong about. Then we can get the most bang (corrections) for our buck (our expert's time). But how can anything like ripple down rules do that?
Probabilistic Ripple Down Rules
This is where I think the needle in a haystack problem can actually be an asset. I don't know how to make a system that can tell how close an example is to the boundary between interesting and uninteresting (the boundary doesn't really exist, even). But it will be a lot easier to make a system that predicts how likely an example is to be an interesting web form.
This way, if the most likely of the available examples is interesting, it will be worth looking at (of course), and if it's classified as not interesting, it's the most likely to have been incorrectly classified, and provide a useful training example.
I will talk about how it might be possible to extract probabilities from a ripple down rules system, but this post is long enough already, so I'll leave that for another post.
Labels: artificial intelligence, ben, research
Comments:
What you describe about identifying how close a particular example is to the decision boundary (ie, finding the interesting ones, where the decision boundary is where an example goes from "yes" to "no") has been formalised in the field of machine learning. There, they call the distance from the decision boundary the "margin". There are certain classifiers that explicitly calculate the distance from the decision boundary and from which you can extract this information. For example, Support Vector Machines select the subset of the examples that are close to the boundary, and most implementations will allow you to dump out your initial dataset with the distance from the boundary attached. Adaboost and related algorithms keep track of a weight for each example, which tells you how hard it was to classify a particular example with respect to the rest of them. By selecting only the examples with a low distance (for SVMs) or a high weight (for AdaBoost), you can identify the subset of "interesting" examples amongst the majority of easy-to-classify-but-uninteresting examples. We use this technique in our work on computational linguistics to identify the small subset of training examples out of millions that might have been misidentified by the human experts.
said:
Extracting a constant set of numerical attributes is more a requirement of implementations of the ML algorithms than a hard-and-fast rule. Theoretically it doesn't matter: you could simply consider your variable sized set of there-or-not attributes as a fixed size set of boolean attributes (over the universe of all seen) and then you can use whichever algorithm you want (this appears to be what you are doing). There are implementations of ML algorithms around that don't require that this be done (eg, TiMBL).
The problem is more that you will end up with a very large number of attributes and you risk overfitting. This is a very frequent problem in NLP, with some potential solutions.
In NLP, people will try to reduce the size of feature sets using various schemes:
- You can filter out words on a stoplist which are known not to have any useful information (eg, "and", "but", etc);
- You can only take words that occur with a certain frequency
- You can use a smoothing function to map different words to the same feature. For example, a morpher or stemmer will map "running", "run" and "ran" to the same form ("run").
- You could use domain based smoothing. For example, look for any occurrence of { GPL, BSD, ... } and use this as a feature.
- You could use other trickier type smoothing. For example, if you look at all of the words found in the same page as interesting examples, you might find "license download FAQ copyleft freedom open source...". You could then look (with the aid of a search engine) at the number of pages that contain each of these "seed" words as well as each other word on your page. For example, if you had a page containing "free as in speech", you would find that there are a high number of hits for the search "free as in speech" "copyleft" and be able to infer that this page is related by that means.
I could go on, but this selection of features is *the* bread and butter of NLP with machine learning, and it is by far the most important part.
As for SVMs versus AdaBoost, I should first point out that I'm not an expert in SVMs, but in my experience they are useful when:
1. You have a small number of data points compared to the number of features;
2. When you can find a domain-specific kernel function that allows you to map your domain to that of the SVM. (Probably not for you, although I believe that some people have done work on string matching kernels).
I would say that SVMs may in fact be well suited to your domain, even if you are required to create large sparse feature vectors.
As for AdaBoost, it works by identifying which examples are difficult to classify with respect to the other examples and by boosting the effort expended to classify these. So it does tend to focus on outliers.
That being said, what is considered an "outlier" is normally either:
1. An example that was misclassified by the expert (noise), in which case you want to identify and fix it, or
2. An interesting example is it is a borderline case which serves to illuminate what are the characteristics of the decision boundary.
I would say that, in your case, I would choose between AdaBoost and SVMs based upon what you can easily get your hands on and massage the data to fit with its requirements. A process of identifying the interesting examples is almost always useful in NLP, be it via SVM or AdaBoost. But most importantly, in nearly any NLP problem where you are using machine learning, features are king and this is where I would spend my effort.
(If you want to contact me off-line, my address is myfirstname.mylastname at zdzlza.com) with all of the zs replaced with i. If you had any datasets already generated I could also have a play around to tell you how I see them).
(Also, what you are trying to do is very difficult... WhizBang tried to do a similar thing during the dot com boom and famously flamed out, and they had 10s of millions of dollars and had hired some of the best minds. Even when they tried to restrict their domain to job postings only, they still weren't able to make their automatic technology good enough to be profitable. If I were you I would try to think small).
said:Post a Comment
Links to this post:
<< Home
What you describe about identifying how close a particular example is to the decision boundary (ie, finding the interesting ones, where the decision boundary is where an example goes from "yes" to "no") has been formalised in the field of machine learning. There, they call the distance from the decision boundary the "margin". There are certain classifiers that explicitly calculate the distance from the decision boundary and from which you can extract this information. For example, Support Vector Machines select the subset of the examples that are close to the boundary, and most implementations will allow you to dump out your initial dataset with the distance from the boundary attached. Adaboost and related algorithms keep track of a weight for each example, which tells you how hard it was to classify a particular example with respect to the rest of them. By selecting only the examples with a low distance (for SVMs) or a high weight (for AdaBoost), you can identify the subset of "interesting" examples amongst the majority of easy-to-classify-but-uninteresting examples. We use this technique in our work on computational linguistics to identify the small subset of training examples out of millions that might have been misidentified by the human experts.
Ben Bildstein said:
First, thanks for your interest.
There are two aspects of the problem I'm working on that I think make it hard to use distance-from-decision-boundary methods. First is that they seem (correct me if I'm wrong) to require very well represented data. Second is that this problem appears not to have a specific boundary. There are examples that are clearly interesting, and others that are clearly not, but also some that, well, where it's not clear. Google Books Advanced Search is an example.
On the first problem, in an earlier post (http://www.cyberlawcentre.org/unlocking-ip/blog/2007/07/problems.html), I talked about the problem of flattening a web form. Ideally we'd extract a constant set of numerical attributes. Instead, I'm extracting a variable-sized set of text strings, and then allowing the creation of rules expressing existential qualities of these strings.
To me, this looks like it really gets in the way of SVM, and I guess AdaBoost though that's new to me. But I definitely feel like there may just be a much better way of extracting attributes that I'm not just seeing. But then it also seems that maybe there isn't.
The second problem, that of the decision boundary being so vague, is probably not so bad. I note that on Wikipedia, it says of AdaBoost that it is sensitive to noisy data and outliers (outliers being the more significant issue in this problem domain), but my understanding is that SVM is a bit better like this. I feel (though again I might be wrong), that if the representation problem had a solution, SVM's support vector lengths might suit my needs.
As this is feeling like an NLP problem in part, I'd love to hear your thoughts.
First, thanks for your interest.
There are two aspects of the problem I'm working on that I think make it hard to use distance-from-decision-boundary methods. First is that they seem (correct me if I'm wrong) to require very well represented data. Second is that this problem appears not to have a specific boundary. There are examples that are clearly interesting, and others that are clearly not, but also some that, well, where it's not clear. Google Books Advanced Search is an example.
On the first problem, in an earlier post (http://www.cyberlawcentre.org/unlocking-ip/blog/2007/07/problems.html), I talked about the problem of flattening a web form. Ideally we'd extract a constant set of numerical attributes. Instead, I'm extracting a variable-sized set of text strings, and then allowing the creation of rules expressing existential qualities of these strings.
To me, this looks like it really gets in the way of SVM, and I guess AdaBoost though that's new to me. But I definitely feel like there may just be a much better way of extracting attributes that I'm not just seeing. But then it also seems that maybe there isn't.
The second problem, that of the decision boundary being so vague, is probably not so bad. I note that on Wikipedia, it says of AdaBoost that it is sensitive to noisy data and outliers (outliers being the more significant issue in this problem domain), but my understanding is that SVM is a bit better like this. I feel (though again I might be wrong), that if the representation problem had a solution, SVM's support vector lengths might suit my needs.
As this is feeling like an NLP problem in part, I'd love to hear your thoughts.
Extracting a constant set of numerical attributes is more a requirement of implementations of the ML algorithms than a hard-and-fast rule. Theoretically it doesn't matter: you could simply consider your variable sized set of there-or-not attributes as a fixed size set of boolean attributes (over the universe of all seen) and then you can use whichever algorithm you want (this appears to be what you are doing). There are implementations of ML algorithms around that don't require that this be done (eg, TiMBL).
The problem is more that you will end up with a very large number of attributes and you risk overfitting. This is a very frequent problem in NLP, with some potential solutions.
In NLP, people will try to reduce the size of feature sets using various schemes:
- You can filter out words on a stoplist which are known not to have any useful information (eg, "and", "but", etc);
- You can only take words that occur with a certain frequency
- You can use a smoothing function to map different words to the same feature. For example, a morpher or stemmer will map "running", "run" and "ran" to the same form ("run").
- You could use domain based smoothing. For example, look for any occurrence of { GPL, BSD, ... } and use this as a feature.
- You could use other trickier type smoothing. For example, if you look at all of the words found in the same page as interesting examples, you might find "license download FAQ copyleft freedom open source...". You could then look (with the aid of a search engine) at the number of pages that contain each of these "seed" words as well as each other word on your page. For example, if you had a page containing "free as in speech", you would find that there are a high number of hits for the search "free as in speech" "copyleft" and be able to infer that this page is related by that means.
I could go on, but this selection of features is *the* bread and butter of NLP with machine learning, and it is by far the most important part.
As for SVMs versus AdaBoost, I should first point out that I'm not an expert in SVMs, but in my experience they are useful when:
1. You have a small number of data points compared to the number of features;
2. When you can find a domain-specific kernel function that allows you to map your domain to that of the SVM. (Probably not for you, although I believe that some people have done work on string matching kernels).
I would say that SVMs may in fact be well suited to your domain, even if you are required to create large sparse feature vectors.
As for AdaBoost, it works by identifying which examples are difficult to classify with respect to the other examples and by boosting the effort expended to classify these. So it does tend to focus on outliers.
That being said, what is considered an "outlier" is normally either:
1. An example that was misclassified by the expert (noise), in which case you want to identify and fix it, or
2. An interesting example is it is a borderline case which serves to illuminate what are the characteristics of the decision boundary.
I would say that, in your case, I would choose between AdaBoost and SVMs based upon what you can easily get your hands on and massage the data to fit with its requirements. A process of identifying the interesting examples is almost always useful in NLP, be it via SVM or AdaBoost. But most importantly, in nearly any NLP problem where you are using machine learning, features are king and this is where I would spend my effort.
(If you want to contact me off-line, my address is myfirstname.mylastname at zdzlza.com) with all of the zs replaced with i. If you had any datasets already generated I could also have a play around to tell you how I see them).
(Also, what you are trying to do is very difficult... WhizBang tried to do a similar thing during the dot com boom and famously flamed out, and they had 10s of millions of dollars and had hired some of the best minds. Even when they tried to restrict their domain to job postings only, they still weren't able to make their automatic technology good enough to be profitable. If I were you I would try to think small).
Links to this post:
<< Home