This file concerns practical issues in representing instances and terms in Tabata.

As told in the Read-Me file of Tabata, instances and terms are represented as lists of atoms.

But an instance i is often first described as  the values of a fixed set of  typed attributes:
d(i) = A1(i), A2(i), ... , An(i). In this case we first define a set of atoms corresponding to each attribute, and then join all these lists to obain the set of atoms allowing to describe the instances and terms.

Let us consider an attribute A and let D  be the corresponding set of values. Then, for any instance i, A(i) belongs to D. An atom is then defined as a constraint "A(i) belongs to s" where s is a particular subset of D.  When we associate a set of atoms to an attribute A, we define the space of all constraints that can be expressed on terms and instances concerning the value of the attribute A. Definition of atoms depends on the type of the attribute and expresses a particular bias about  constraints that will be allowed in terms to learn. We discuss hereafter some usual types and bias:

Nominal : D is a set of non structured values.
1)A first frequent bias is then that any subset of D is allowed as a constraint in terms.
In this case we relate to each value x of D, the constraint "A(i) belongs to D-{x}".
Let us take as an exemple the attribute A representing the first nucleotide of a sequence i.
D={a, t, g, c}.
Atoms: a1="A(i) belongs to D-{a}"
               ="A(i) belongs to {t, g, c}"
           a2="A(i) belongs to {a, g, c}"
           a3="A(i) belongs to {a, t, c}"
           a4="A(i) belongs to {a, t, g}"

if for instance A(i) =a, then i will be described as (a2 & a3 & a4) that corresponds to:
       "A(i) belongs to the intersection of  {a, g, c}, {a, t, c}, and {a, t, g}"
i.e. "A(i) belongs to {a}"

Any subset of {a, t, g, c} can be expressed as a conjunction of such atoms.

2) A second usual bias is that only single values are allowed as constraints in terms.
In this case we simply relate to each value x of D, the constraint "A(i) belongs to {x}".
In the above example:
D={a, t, g, c}.
Atoms: a1="A(i) belongs to {a}
           a2="A(i) belongs to {t}"
           a3="A(i) belongs to {g}"
           a4="A(i) belongs to {c}"

if for instance A(i) =a, then i will be simply described as (a1)

Only singletons of {a,t,g,c} can be expressed as a conjunction of such atoms.(more precisely only one such atom can appear in a given term). If no such atom appears within a term, this corresponds to the null constraint regarding the value of A.

Boolean attribute
This is a particular nominal attribute with values True and False..
1) In boolean function learning the bias 1) and 2)  above are equivalent and allow both positive and negative litterals within terms:
 D={True, False}
b      =  "A(i) = True"
not-b = "A(i) = False"

2) But when negative litterals should not be allowed in terms, then we consider only one atom:
D={True, False}
b      =  "A(i) = True"
So when A(i) is False no atom corresponding to A appears in the description of the instance i.

Consider for example two instances defined using two boolean attributes b1 and b2:
i1:  (b1=True & b2 = False)
i2: (b1=False & b2 = False)

Then the most specific generalisation of these two instances is the intersection of their representations, namely here:
case 1):  (b1 not-b2) Inter (not-b2) = not-b2
case 2):  (b1) Inter () = ()
In this last case, as negative litterals are not allowed the most specific generalisation is
() the TOP of the lattice, i.e. the "Anything" term that covers any possible instance.

Hierarchical :

When considering hierarchichal attributes D is structured as a hierarchy. More precisely the values are the leaves of the hierarchy. Internal nodes represent particular subsets of leaves that should be allowed to appear in terms. Let us consider the attribute Shape associated to the following hierarchy:

                      /                     \
             quadrilateral            triangle
            /                   \
       square        lozenge

The usual bias consists in generalizing terms by climbing-up the hierarchy.
In this case the atoms simply are the  nodes of the hierarchy.
Let us then consider two  instances i1 and i2 such that shape(i1)=square, and shape(i2)=lozenge. First as an instance is represented as the list of all the atoms whose value is True, i1 and i2 are represented as follows:
i1=   {Polygon Quadrilateral  Square}
i2 =  {Polygon Quadrilateral  Lozenge}

Then most specific generalization of i1 and i2 will be:
{Polygone Quadrilateral} that corresponds to "Shape = quadrilateral".

As a consequence the general rule is that when describing an instance we should add to its description all the nodes on the path starting from the root of the hierarchy and ending to the leaf corresponding to its value.

Note that if we define as a distance between terms (as used in the k-nearest neighbor decision procedure) the minimum number of atoms to substract to one term in order to make it cover the other one, we obtain  the  minimum upside steps to the common parent node. This property is true not only when considering instances but as far as we consider "closed" terms, i.e. most specific generalisations of instances.

Ordered : D is a set of totally ordered values.
A frequent bias is that only intervals of D are allowed as constraints within terms.

Let us take as an exemple the attribute A representing the age of the child i.
D={0, 1, 2, 3, ... , 10}.
We first propose here a "naive" coding: All intervals of D are allowed, i.e.
all {j, j+1, ..., k} with j less or equal to k. As a consequence we define as atoms all the constraints:
   <=j  representing "A(i) >=  j"
together with 
   >k   representing   "A(i) < k ".

the instances i1 and i2 such that A(i1) =1 and A(i2)=3 are then represented as:
i1 : {>0     <=1    <=2     <=3    <=4    ...        <=10}
i2 : {>0       >1      >2     <=3    <=4    ...        <=10}

Then most specific generalization of i1 and i2 will be:
       {>0                        <=3    <=4    ...        <=10} that corresponds to the interval

Remark that the distance defined above represents here the length of the interval:
d(i1, i2) = 2 (we substract  <=1  and   <=2 to i1 in order to cover i2).

This coding is "naive" since an ordered attribute with D ={0, ..., 1000000} (e.g. the number of persons living in a town) would result in 20000000 atoms. A clever coding consists in considering  only intervals whose bounds are values of instances (this does not change the reduced space of most specific generalisations). As we will see in the next case (numeric attributes) it may be necessary to further reduce the set of intervals  allowed as constraints within terms. 

Numeric attribute
D is an infinite or continuous set of numbers. Here again only intervals of D are allowed as constraints within terms. We consider then the intervals ]xr..xs] where xr and xs are thresholds belonging to a user-defined subset {x1, ..., xk} of D. As in the previous case this lattice of intervals is obtained by considering the atoms {<=x1, >x1, <=x2, >x2, ..., <=xk, >xk} which respectively express the constraints "A(i) <=x1", "A(i) >x1", ..., "A(i) <=xk", "A(i) >xk".

To define the thresholds (i.e. to discretize the numerical attribute) a convenient (and usual) method consists in considering the set of all known instances (without considering wether they are positive or negative instances of the target concept). Then the thresholds are computed such that in each resulting disjoint interval of the domain D of the numeric attribute, the number of instances is roughly constant (put otherwise, the distribution is flat). This is the method we used to discretize the waveform problem used as a Trial in the distribution tar file. The method has the advantage to avoid overfitting thresholds to particular distributions of positive and negative instances. Furthermore  such a coding generates many empirical implications, i.e. redundancies inside instances representations. However working in the reduced space results in avoiding many computations in cases of redundancies. More precisely the CPU time in Tabata is roughly linear with respect to the number of thresholds (i.e. the number of atoms). Our experience is that typically, representing  a numeric attribute by defining 10 thresholds in most cases is enough (no gain is to be expected with a more accurate description).

An example

Each instance represents an individual whose age is below 23, working in a city belonging to {Paris, New-York, Rome}, and whose favorite shape belongs to (square, lozenge, triangle). The target concept is here "will buy a motocycle this year".
Positive instances are
John (21,  New-York, square)
Jean (18,  Paris, lozenge)
Joanna(15,  Rome, triangle)

the negative instance is
Paul ( 16,Paris, triangle )

Atoms are
{Age>15, Age<=15, Age>16, Age<=16, Age>18, Age<=18, Age>21, Age<=21,
New-York, Paris , Rome,
Polygon, Quadrilateral,  Lozenge, Square, Triangle}

John is described as:
{Age>15, , Age>16, , Age>18, , , Age<=21, New-York, , ,Polygon, Quadrilateral,  , Square,}
and represented in the file "" as the line:
1010100110011010 Positive
(the string is a 16-length binary string, each bit corresponds to one atom that is set to 1 if the atom belongs to the description of John)

Jean is described as:
{Age>15, , Age>16, , ,Age<=18, , Age<=21, ,Paris , ,Polygon, Quadrilateral, Lozenge , ,}
and leads to:
1010010101011100 Positive

The most specific generalisation of John and Jean is:


i.e. {Age>15, , Age>16, , , , , Age<=21, , , ,Polygon, Quadrilateral,  , ,}

Remark that this term is logically equivalent to
{ , ,  Age>16, , , , , Age<=21, , , , , Quadrilateral,  , ,} but this is unknown to Tabata.

However  on the four instances problem proposed here, this term covers only Jean and John, and as a consequence a most general term covering exactly the same positive and negative instances is
{ , ,  Age>16, , , , , Age<=21, , , , , ,  , ,}.