Attribute Class has just 2 values, poisonous or edible, but in Emping the user can select which attribute is to be predicted. Selecting Habitat, which has 7 values, immediately highlights a difference with Class.
Emping can test for ambiguous rules, which are rows in the table which are identical, except for the consequent value. There were no such rows for Class, but for Habitat there are 4360. This is an OO Calc screen shot. It looks like the clusters are equal, but one or more differences show up further to the left. But within the groups, which are separated by a blank line, the only difference is the habitat.
After the newlines had been removed, the file was imported into the SQLite database as a separate table. From this table the number of ambiguities, as shown below, was determined through SQL queries. The number of unambiguous rules is the difference, for each habitat, between this and the number of rules in the source table.
Habitat | Original | Ambiguous | Unambiguous |
---|---|---|---|
urban | 368 | 272 | 96 |
grasses | 2148 | 1092 | 1056 |
woods | 3148 | 1016 | 2132 |
paths | 1144 | 1112 | 32 |
waste | 192 | 0 | 192 |
leaves | 832 | 576 | 256 |
meadows | 292 | 292 | 0 |
Total | 8124 | 4360 | 3764 |
In Emping ambiguous rules cancel each other. If some ravens are black and some (even if only one) ravens are white, then the fact that something is a raven cannot be used to predict its color.
Emping finds those conjunctions of factors which uniquely determine the consequent value. Because there are none for meadows, as follows directly from checking the ambiguities, there are no reductions either. For waste all rules are unambiguous. For the others, the reductions still determine the habitat, but the ambiguities limit the number of possibilities. They do have an effect, and cannot be omitted from the source table.
The reduction with Habitat as the consequent attribute took 48 minutes, about 7 minutes per value. Each reduction compares the positive values with the rows of negations, so more values does not only take more reductions, but also more comparisons for each reduction. There are 6055 reductions for Habitat. The screenshot shows the start, from the left, with all consequent values urban.
Because emping orders all reductions by length it can be seen immediately that a sunken cap shape implies an urban habitat. As in the first example, it is easy to test the singletons. The screenshot shows the results for woods. Each commented number to the right is the number of totals for that partial query. The fact that some numbers do not change indicate that the new attribute-value is already included in the rows denoted by the prior attribute values.
Next the abductions were performed for all attribute values in the reduction table.This did not include meadows, as indicated by the following screen shot.
Note: due to a bug in Emping-0.6, unfortunately the program now hangs. To avoid this, check the reduction spreadsheet for any missing values first. The results for the other habitat values were:
Habitat | Reductions | Equivalences | Least General | Most General |
---|---|---|---|---|
urban | 775 | 89 | 32 | 1 |
grasses | 1021 | 355 | 155 | 1 |
woods | 2118 | 653 | 232 | 20 |
paths | 666 | 46 | 14 | 1 |
waste | 486 | 81 | 21 | 1 |
leaves | 989 | 149 | 47 | 8 |
Totals | 6055 | 1373 | 501 | 32 |
The interesting ones are those with only 1 most general group of rules. The graph for grasses, for example, as shown with dotty, looks like this.
The 155 least general rules or equivalence classes of rules, which all imply a habitat of grasses, all factor through just 1 most general equivalence group. The zoomed out graph shows that this most general is node 32, and that the number of equivalence classes is 6.
The OO Calc spreadsheet shows
The six rows are
Gill Spacing = crowded | Gill Size = broad | ||
Bruises? = no | Gill Size = broad | Stalk Color above Ring = white | |
Bruises? = no | Gill Size = broad | Stalk color below Ring = white | |
Bruises? = no | Odor = none | Gill Spacing = crowded | Stalk color below Ring = white |
Class = edible | Odor = none | Gill Spacing = crowded | Stalk color below Ring = white |
Class = edible | Bruises? = no | Gill spacing = crowded | Stalk color below Ring = white |
The first thing is to test whether these conjunctions are indeed equivalent. The first query is shown here
Testing the reverse showed that no bruises, a broad gill size and a white stalk color above ring indeed implies a crowded gill spacing (and a habitat of grasses). The number of rows was 1056, as many as the first.
Using the transitivity of the equivalence property we can test all equivalences by taking the first case, testing the properties of the next, and testing case 1 against case 6. Here is another screen shot.
Again, as in all 6 tests, the number of rows was 1056. This is exactly the number of unambiguous rules for grasses we calculated before.
Of particular interest is the occurrence of edible as a factor. We would expect the 4 conjunctions which do not contain edible to turn up in the reduction for the Class attribute from example 1. Using the AutoFilter tool from OO Calc these reductions can easily be found, if they exist.
Examining that spreadsheet showed the first 3 in lines 2064, 2857 and 2962, respectively, but not the 4th.
The reason is there exists a shorter combination, namely Bruises? = no, Odor = none, Stalk color below Ring = white, which does show up in the reductions for Class. Testing with Sqliteman reveals exactly 1200 rows, all edible, but not all with a grasses habitat (some were urban or woods). Only the additional condition that the gill spacing is crowded returns the 1056 rows with a grasses habitat.
Emping has discovered that all the shortest rules for a grasses habitat can be grouped into 355 equivalence classes, and that each of these implies just one top equivalence class, the one in the above table. Testing this, with node 221 (taken at random) reults in the following.
In the second equivalent representation the no bruises factor is replaced by a broad gill size (the other three factors the same). The result was the same. Node 1 consists of just 1 rule, and is a singleton too.
Now there are 384 rows, but the displayed factors are as Emping discovered. Node 351 consists of 4 rules, with 2 common factors and a different remaining one. Each case showed the same results, with 72 rows, as did the accumulated query shown next.
In all three tests the attribute Class was included in the display, to show that, because edible is a factor in the most general equivalence for grasses, each rule which implies grasses also implies edible. However, the rule is not that all mushrooms that grow in a grasses habitat are edible, and surely not that all edible mushrooms have a grasses habitat.
What has been discovered, indirectly, is that all 1056 mushrooms which have a habitat of grasses only, are edible. Those among the 1092 row descriptions which also grow elsewhere could be poisonous.