As told in the Read-Me file of Tabata, instances and terms are represented as lists of atoms.

But an instance i is often first described as the values of a
fixed set of typed attributes:

d(i) = A1(i), A2(i), ... , An(i). In this case we first define a set
of atoms corresponding to each attribute, and then join all these lists
to obain the set of atoms allowing to describe the instances and terms.

Let us consider an attribute A and let D be the corresponding set of values. Then, for any instance i, A(i) belongs to D. An atom is then defined as a constraint "A(i) belongs to s" where s is a particular subset of D. When we associate a set of atoms to an attribute A, we define the space of all constraints that can be expressed on terms and instances concerning the value of the attribute A. Definition of atoms depends on the type of the attribute and expresses a particular bias about constraints that will be allowed in terms to learn. We discuss hereafter some usual types and bias:

-----------

Nominal : D is a set of non structured values.

-----------

1)A first frequent bias is then that any subset of D is allowed as
a constraint in terms.

In this case we relate to each value x of D, the constraint "A(i) belongs
to D-{x}".

Let us take as an exemple the attribute A representing the first nucleotide
of a sequence i.

D={a, t, g, c}.

Atoms: a1="A(i) belongs to D-{a}"

="A(i) belongs to {t, g, c}"

a2="A(i)
belongs to {a, g, c}"

a3="A(i)
belongs to {a, t, c}"

a4="A(i)
belongs to {a, t, g}"

if for instance A(i) =a, then i will be described as (a2 & a3 &
a4) that corresponds to:

"A(i) belongs to the intersection
of {a, g, c}, {a, t, c}, and {a, t, g}"

i.e. "A(i) belongs to {a}"

Any subset of {a, t, g, c} can be expressed as a conjunction of such atoms.

2) A second usual bias is that only single values are allowed as constraints
in terms.

In this case we simply relate to each value x of D, the constraint
"A(i) belongs to {x}".

In the above example:

D={a, t, g, c}.

Atoms: a1="A(i) belongs to {a}

a2="A(i)
belongs to {t}"

a3="A(i)
belongs to {g}"

a4="A(i)
belongs to {c}"

if for instance A(i) =a, then i will be simply described as (a1)

Only singletons of {a,t,g,c} can be expressed as a conjunction of such atoms.(more precisely only one such atom can appear in a given term). If no such atom appears within a term, this corresponds to the null constraint regarding the value of A.

-------------------

Boolean attribute

-------------------

This is a particular nominal attribute with values True and False..

1) In boolean function learning the bias 1) and 2) above are
equivalent and allow both positive and negative litterals within terms:

D={True, False}

b = "A(i) = True"

not-b = "A(i) = False"

2) But when negative litterals should not be allowed in terms, then
we consider only one atom:

D={True, False}

b = "A(i) = True"

So when A(i) is False no atom corresponding to A appears in the description
of the instance i.

Consider for example two instances defined using two boolean attributes
b1 and b2:

i1: (b1=True & b2 = False)

i2: (b1=False & b2 = False)

Then the most specific generalisation of these two instances is the
intersection of their representations, namely here:

case 1): (b1 not-b2) Inter (not-b2) = not-b2

case 2): (b1) Inter () = ()

In this last case, as negative litterals are not allowed the most specific
generalisation is

() the TOP of the lattice, i.e. the "Anything" term that covers any
possible instance.

-------------

Hierarchical :

-------------

When considering hierarchichal attributes D is structured as a hierarchy. More precisely the values are the leaves of the hierarchy. Internal nodes represent particular subsets of leaves that should be allowed to appear in terms. Let us consider the attribute Shape associated to the following hierarchy:

polygon

/
\

quadrilateral
triangle

/
\

square
lozenge

The usual bias consists in generalizing terms by climbing-up the hierarchy.

In this case the atoms simply are the nodes of the hierarchy.

Let us then consider two instances i1 and i2 such that shape(i1)=square,
and shape(i2)=lozenge. First as an instance is represented as the list
of all the atoms whose value is True, i1 and i2 are represented as follows:

i1= {Polygon Quadrilateral Square}

i2 = {Polygon Quadrilateral Lozenge}

Then most specific generalization of i1 and i2 will be:

{Polygone Quadrilateral} that corresponds to "Shape = quadrilateral".

As a consequence the general rule is that when describing an instance we should add to its description all the nodes on the path starting from the root of the hierarchy and ending to the leaf corresponding to its value.

Note that if we define as a distance between terms (as used in the k-nearest
neighbor decision procedure) the minimum number of atoms to substract to
one term in order to make it cover the other one, we obtain the
minimum upside steps to the common parent node. This property is true not
only when considering instances but as far as we consider "closed" terms,
i.e. most specific generalisations of instances.

---------

Ordered : D is a set of totally ordered values.

---------

A frequent bias is that only intervals of D are allowed as constraints
within terms.

Let us take as an exemple the attribute A representing the age of the
child i.

D={0, 1, 2, 3, ... , 10}.

We first propose here a "naive" coding: All intervals of D are allowed,
i.e.

all {j, j+1, ..., k} with j less or equal to k. As a consequence we
define as atoms all the constraints:

<=j representing "A(i) >= j"

together with

>k representing "A(i) < k ".

the instances i1 and i2 such that A(i1) =1 and A(i2)=3 are then represented
as:

i1 : {>0 <=1 <=2
<=3 <=4 ...
<=10}

i2 : {>0 >1
>2 <=3 <=4
... <=10}

Then most specific generalization of i1 and i2 will be:

{>0
<=3 <=4 ...
<=10} that corresponds to the interval

{1,2,3}.

Remark that the distance defined above represents here the length of
the interval:

d(i1, i2) = 2 (we substract <=1 and <=2
to i1 in order to cover i2).

This coding is "naive" since an ordered attribute with D ={0, ..., 1000000} (e.g. the number of persons living in a town) would result in 20000000 atoms. A clever coding consists in considering only intervals whose bounds are values of instances (this does not change the reduced space of most specific generalisations). As we will see in the next case (numeric attributes) it may be necessary to further reduce the set of intervals allowed as constraints within terms.

---------------------

Numeric attribute

---------------------

D is an infinite or continuous set of numbers. Here again only intervals
of D are allowed as constraints within terms. We consider then the intervals
]xr..xs] where xr and xs are thresholds belonging to a user-defined subset
{x1, ..., xk} of D. As in the previous case this lattice of intervals is
obtained by considering the atoms {<=x1, >x1, <=x2, >x2, ..., <=xk,
>xk} which respectively express the constraints "A(i) <=x1", "A(i) >x1",
..., "A(i) <=xk", "A(i) >xk".

To define the thresholds (i.e. to discretize the numerical attribute)
a convenient (and usual) method consists in considering the set of all
known instances (without considering wether they are positive or negative
instances of the target concept). Then the thresholds are computed such
that in each resulting disjoint interval of the domain D of the numeric
attribute, the number of instances is roughly constant (put otherwise,
the distribution is flat). This is the method we used to discretize the
waveform problem used as a Trial in the distribution tar file. The method
has the advantage to avoid overfitting thresholds to particular distributions
of positive and negative instances. Furthermore such a coding generates
many empirical implications, i.e. redundancies inside instances representations.
However working in the reduced space results in avoiding many computations
in cases of redundancies. More precisely the CPU time in Tabata is roughly
linear with respect to the number of thresholds (i.e. the number of atoms).
Our experience is that typically, representing a numeric attribute
by defining 10 thresholds in most cases is enough (no gain is to be expected
with a more accurate description).

--------------

An example

--------------

Each instance represents an individual whose age is below 23, working
in a city belonging to {Paris, New-York, Rome}, and whose favorite shape
belongs to (square, lozenge, triangle). The target concept is here "will
buy a motocycle this year".

Positive instances are

John (21, New-York, square)

Jean (18, Paris, lozenge)

Joanna(15, Rome, triangle)

the negative instance is

Paul ( 16,Paris, triangle )

Atoms are

{Age>15, Age<=15, Age>16, Age<=16, Age>18, Age<=18, Age>21,
Age<=21,

New-York, Paris , Rome,

Polygon, Quadrilateral, Lozenge, Square, Triangle}

John is described as:

{Age>15, , Age>16, , Age>18, , , Age<=21, New-York, , ,Polygon,
Quadrilateral, , Square,}

and represented in the file "Tabata_buy_a_moto.app" as the line:

1010100110011010 Positive

(the string is a 16-length binary string, each bit corresponds to one
atom that is set to 1 if the atom belongs to the description of John)

Jean is described as:

{Age>15, , Age>16, , ,Age<=18, , Age<=21, ,Paris , ,Polygon,
Quadrilateral, Lozenge , ,}

and leads to:

1010010101011100 Positive

The most specific generalisation of John and Jean is:

1010100110011010

Inter

1010010101011100

=

1010000100011000

i.e. {Age>15, , Age>16, , , , , Age<=21, , , ,Polygon, Quadrilateral,
, ,}

Remark that this term is logically equivalent to

{ , , Age>16, , , , , Age<=21, , , , , Quadrilateral,
, ,} but this is unknown to Tabata.

However on the four instances problem proposed here, this term
covers only Jean and John, and as a consequence a most general term covering
exactly the same positive and negative instances is

{ , , Age>16, , , , , Age<=21, , , , , , , ,}.