Example 2: Determining the habitat of mushrooms from a table

Attribute Class has just 2 values, poisonous or edible, but in Emping the user can select which attribute is to be predicted. Selecting Habitat, which has 7 values, immediately highlights a difference with Class.

Emping can test for ambiguous rules, which are rows in the table which are identical, except for the consequent value. There were no such rows for Class, but for Habitat there are 4360. This is an OO Calc screen shot. It looks like the clusters are equal, but one or more differences show up further to the left. But within the groups, which are separated by a blank line, the only difference is the habitat.

mushroom ambiguous for habitat

After the newlines had been removed, the file was imported into the SQLite database as a separate table. From this table the number of ambiguities, as shown below, was determined through SQL queries. The number of unambiguous rules is the difference, for each habitat, between this and the number of rules in the source table.

Habitat Original Ambiguous Unambiguous
urban 368 272 96
grasses 2148 1092 1056
woods 3148 1016 2132
paths 1144 1112 32
waste 192 0 192
leaves 832 576 256
meadows 292 292 0
Total 8124 4360 3764

In Emping ambiguous rules cancel each other. If some ravens are black and some (even if only one) ravens are white, then the fact that something is a raven cannot be used to predict its color.

Emping finds those conjunctions of factors which uniquely determine the consequent value. Because there are none for meadows, as follows directly from checking the ambiguities, there are no reductions either. For waste all rules are unambiguous. For the others, the reductions still determine the habitat, but the ambiguities limit the number of possibilities. They do have an effect, and cannot be omitted from the source table.

The reduction with Habitat as the consequent attribute took 48 minutes, about 7 minutes per value. Each reduction compares the positive values with the rows of negations, so more values does not only take more reductions, but also more comparisons for each reduction. There are 6055 reductions for Habitat. The screenshot shows the start, from the left, with all consequent values urban.

mushroom reduction for habitat

Because emping orders all reductions by length it can be seen immediately that a sunken cap shape implies an urban habitat. As in the first example, it is easy to test the singletons. The screenshot shows the results for woods. Each commented number to the right is the number of totals for that partial query. The fact that some numbers do not change indicate that the new attribute-value is already included in the rows denoted by the prior attribute values.

mushroom sql query for woods

Next the abductions were performed for all attribute values in the reduction table.This did not include meadows, as indicated by the following screen shot.

mushroom sql query for woods

Note: due to a bug in Emping-0.6, unfortunately the program now hangs. To avoid this, check the reduction spreadsheet for any missing values first. The results for the other habitat values were:

Habitat Reductions Equivalences Least General Most General
urban 775 89 32 1
grasses 1021 355 155 1
woods 2118 653 232 20
paths 666 46 14 1
waste 486 81 21 1
leaves 989 149 47 8
Totals 6055 1373 501 32

The interesting ones are those with only 1 most general group of rules. The graph for grasses, for example, as shown with dotty, looks like this.

graph for abduction grasses

The 155 least general rules or equivalence classes of rules, which all imply a habitat of grasses, all factor through just 1 most general equivalence group. The zoomed out graph shows that this most general is node 32, and that the number of equivalence classes is 6.

single msg in graph for abduction grasses

The OO Calc spreadsheet shows

msg spreadsheet in abduction grasses

The six rows are

Gill Spacing = crowded Gill Size = broad
Bruises? = no Gill Size = broad Stalk Color above Ring = white
Bruises? = no Gill Size = broad Stalk color below Ring = white
Bruises? = no Odor = none Gill Spacing = crowded Stalk color below Ring = white
Class = edible Odor = none Gill Spacing = crowded Stalk color below Ring = white
Class = edible Bruises? = no Gill spacing = crowded Stalk color below Ring = white

The first thing is to test whether these conjunctions are indeed equivalent. The first query is shown here

sqliteman msg grasses test 1

Testing the reverse showed that no bruises, a broad gill size and a white stalk color above ring indeed implies a crowded gill spacing (and a habitat of grasses). The number of rows was 1056, as many as the first.

Using the transitivity of the equivalence property we can test all equivalences by taking the first case, testing the properties of the next, and testing case 1 against case 6. Here is another screen shot.

sqliteman msg grasses test 2

Again, as in all 6 tests, the number of rows was 1056. This is exactly the number of unambiguous rules for grasses we calculated before.

Of particular interest is the occurrence of edible as a factor. We would expect the 4 conjunctions which do not contain edible to turn up in the reduction for the Class attribute from example 1. Using the AutoFilter tool from OO Calc these reductions can easily be found, if they exist.

Examining that spreadsheet showed the first 3 in lines 2064, 2857 and 2962, respectively, but not the 4th.

The reason is there exists a shorter combination, namely Bruises? = no, Odor = none, Stalk color below Ring = white, which does show up in the reductions for Class. Testing with Sqliteman reveals exactly 1200 rows, all edible, but not all with a grasses habitat (some were urban or woods). Only the additional condition that the gill spacing is crowded returns the 1056 rows with a grasses habitat.

Emping has discovered that all the shortest rules for a grasses habitat can be grouped into 355 equivalence classes, and that each of these implies just one top equivalence class, the one in the above table. Testing this, with node 221 (taken at random) reults in the following.

sqliteman dependency test grasses node 221

In the second equivalent representation the no bruises factor is replaced by a broad gill size (the other three factors the same). The result was the same. Node 1 consists of just 1 rule, and is a singleton too.

sqliteman dependency test grasses node 1

Now there are 384 rows, but the displayed factors are as Emping discovered. Node 351 consists of 4 rules, with 2 common factors and a different remaining one. Each case showed the same results, with 72 rows, as did the accumulated query shown next.

sqliteman dependency test grasses node 351

In all three tests the attribute Class was included in the display, to show that, because edible is a factor in the most general equivalence for grasses, each rule which implies grasses also implies edible. However, the rule is not that all mushrooms that grow in a grasses habitat are edible, and surely not that all edible mushrooms have a grasses habitat.

What has been discovered, indirectly, is that all 1056 mushrooms which have a habitat of grasses only, are edible. Those among the 1092 row descriptions which also grow elsewhere could be poisonous.