PALINGOL, A BRIEF USER'S GUIDE

FOREWORD

We present here a summary of the complete syntax of Palingol, i.e. the language words and their usage. Users familiar with BNF form may also wish to have some information about the Palingol Grammar (and/or download the lex/yacc sources).

If you want to be alerted for new releases of Palingol, have questions or comments, or some bugs to report, please send an e-mail to : public@abi.snv.jussieu.fr (this is an alias to author's adress).


OVERVIEW

Palingol is a descriptive programming language specialized in the description of nucleic acids secondary structures.

A complete structure description is made of:

Some of these constraints act on an individual helix, others on the relations between two or more helices.

Notes:

  1. Some terms can be abreviated. In this case, the optional part of the word is parenthesized: "len(gth)" means that one can use either "len" or "length".
  2. In a Palingol program, all lines beginning with # are considered as commentary, and thus not interpreted.

SECTIONS

A Palingol program is divided into sections. One or more "helix" section allow to describe each individual helix; Constraints in assembling these helices to form the complete structure are indicated in the "span" and the "cross" section. The other sections are optional. Hereafter are presented all the sections, in the order they appear in a program.

start [optional]

%start {
 ...
}

Instructions in this section are sequentially performed once, in the beginning of execution. For instance, this section can be used to print a general header, compile a consensus matrix, set a counter variable to its initial value, etc.

before [optional]

%before {
 ...
}

Executed before each sequence is tested. In this section, the user can print the sequence name, keep a record of its length, and so on.

program [mandatory]

%program {
 ...
}

helix (1 or more)

helix {
 ...
}

Each helix of the structure is described in its own "helix" section. The order of these sections is the same as that of their respective helices: from 5' to 3', by (primary criterion) increasing position of head start, and (secondary criterion) decreasing postion of tail start

span

span {
 ...
}

The "span" section says how far from each helix (say number i) will the next one (number i+1) be searched for. The reference point in the helix (i+1) is always the 1st nucleotide of its head ; the reference for the helix i can be any of its six charactaeristic positions: start/end of head/loop/tail.

Syntax:

%s(tart)_h(ead) max_dist
%s(tart)_l(oop) max_dist
%s(tart)_t(ail) max_dist
%e(nd)_h(ead) max_dist
%e(nd)_l(oop) max_dist
%e(nd)_t(ail) max_dist

In all these instructions, max_dist represents the maximum number of nucleotides allowed between the two reference points.

Notes:

cross

cross {
 ...
}

The "cross" section contains global constraints. Constrains expressed in this section usually concern at least two helices. Printing instructions are usually written here.

after [optional]

%after {
 ...
}

Instructions of the "after" section are executed after each sequence has been examined.

end [optional]

%end {
 ...
}

This section is executed at the very end, i.e. after all sequences have been tested. A general counter can be evaluated and/or printed here, for example. It is a good idea to use the "end" section print a footer, as this allows to distinguish between a normal ending and an endless run of the program.


HELIX DESCRIPTORS

Each individual "helix" of a structure is considered as a hairpin, made of a double strand and a loop. Note that in a complete structure, some nucleotides belonging to the "loop" of one helix may be involved in the double-stranded part of an other helix. The double stranded part of a "helix" is made of a "head" and a "tail", encountered in this order from 5' to 3'.

physical_entity

seqname, fullseq, head, loop, tail

seqname is the name of the sequence examined. fullseq is the nucleotide sequence itsef. The other physical entities: head,loop and tail are those of one individual helix.

Note:

In a "helix" section, head, loop and tail always refer to the current helix. In the "cross" section, they have to be followed by "#" and the helix number: head #1

numeric

start, end, len(gth)

Respectively return the start, end position and length of a physical entity.

Syntax:

( start physical_entity )
( end physical_entity )
( len physical_entity )

Examples:

Note:

If a physical_entity is used in a numeric expression, it is interpreted as its length. Obviously, the only numeric value of the physical entities "seqname" and "fullseq" is their "length".

alphanumeric

seq

Returns the sequence of a physical entity.

Syntax:

( seq physical_entity )

Examples:

( seq loop ) returns the sequence of the current loop.

( seq ( tail #3 ) ) returns the sequence of the tail of the third helix.

( seq seqname ) returns the name of the sequence presently considered.

boolean

direct(strand)

Returns "true" if the strand currently examined is direct, "false" if it is the reverse-complement.

Syntax:

( direct )

VARIABLES

variable names and types

A variable name can be any alphanumeric string beginning with "$", andcontainig no space character. $var1 is thus a valid variable name.

The type of a variable is automatically set when the variable is used for the first time. Therefore, variables do not need to be declared.

Note:

All numeric variables and constants are integers.

operations

set, exist(s), get

"set" allows to assign a value to a variable. "exist" checks whether a variable is currently available. The "get" function gives the current value of a variable.

Syntax:

( set $variable expression )
( exist $variable )
( get $variable )

Examples:

( set $var1 4 ) stores the value 4 into the variable $var1, which is therefore set to the numeric type.

( set $vars "GAATTC" ) stores the string "GAATTC" into $vars, a string type variable.

( get $var1 ) returns the value of $var1.

Note:

The "get" function is in fact implicit when a variable name is used alone:

to store the value of $var1 into $var2

The syntax:

( set $var2 $var1 )

is equivalent to:

( set $var1 ( get $var2 ) )

OPERATORS

In the Palingol descriptive language, an occurence of a structure is a part of a sequence which meets a number of requirements. A Palingol program is thus a formal expression of a series of constraints which define the structure. Hence, the basic operators of the language are boolean.

Note:

Some operators do not seem to be boolean, but are seen by the compiler as booleans which always return "true". These are termed "side effect" operators. The "set" and "print" operator are two of them.

comparison (boolean) operators

eq, ne, ge, gt, le, lt, str(ing)c(o)mp(are), str(ing)(in)str(ing)

Respectively (for numbers ) equal, not equal, greater or equal, greater, lower or equal, lower, and (for strings) equal and included.

Syntax:

( eq expr1 expr2 )
( ne expr1 expr2 )
( ge expr1 expr2 )
( gt expr1 expr2 )
( le expr1 expr2 )
( lt expr1 expr2 )
( strcmp string1 string2 )
( strstr string1 string2)

All these operators read from left to right.

Examples:

( le ( len head ) 9 ) returns "true" if the head of the helix examined is shorter or equal to 9 nucleotides.

( strstr "CAGTG" ( seq loop #2 ) ) returns "true' if the loop sequence of the 2nd helix contains "CAGTG".

numeric operators

opp(osite), sq(uare), add, sub(stract), mul(tiply), div(ide), mod(ulo)

The classical arithmetic operators, for integers.

Syntax:

( opp expr )
( sqr expr )
( add expr1 expr2 )
( sub expr1 expr2 )
( mul expr1 expr2 )
( div expr1 expr2 )
( mod expr1 expr2 )

All these operators read from left to right. Priorities are always indicated by parenthesis.

Examples:

( sub ( start head #2 ) ( end tail #1 ) ) gives the number of nucleotides separating the starting of the head of the 2nd helix from the end of the tail of the first helix.

( set $scor ( add $scor ( div head 2 ) ) ) divides the length of the current helix head by 2, adds the result to the variable $scor, and replaces the old value of $scor by the resulting sum. In other words, this instruction increases $scor by half the length of the current helix.

string operators

str(ing)beg(ins), str(ing)end(s), s(ub)str(ing), complem(ent), inver(se), complinv(erse), str(ing)cat

Respectively return the first and last nucleotide of a string, a substring of a string, the complement, inverse, and complement-inverse of a sequence. "strcat" concatenates two strings and returns the resulting string.

Syntax:

( strbeg str_expr )
( strend str_expr )
( sstr str_expr start_position length )
( complem str_expr )
( inver str_expr )
( complinv str_expr )
( strcat str_expr1 str_expr2 )

base pairing

bpcompile, scorebp

bpcompile stores a base pairing score, and scorebp uses it to compute the base pairing of two sequences.

Syntax:

( bpcompile scoring_string )
( scorebp sequence1 sequence2 )

The scoring_string is a series of base pairs followed by their score. scorebp computes the base pairing score of sequence1 with sequence2 using the score computation previously compiled.

Example:

( bpcompile "AT 2 GC 3 GT 1" )
( ge ( scorebp head#1 loop#2 ) 4 )

sets the pairing score to 2 for a A-T pair, 3 for a G-C, 1 for a G-T, then computes the pairing score of the head of the first helix with the loop of the second one (this would result in an alternative structure), and checks that the result is at least 4.

Notes:

If the two sequences are not the same length, then scorebp stops at the end of the shortest.

If the sequences contain IUPAC encoded bases, then the score cosidered is the maximum score of the possible pairs. scorebp always uses the latest compiled set of scoring values. In most cases, the same set of scores is used throughout the program. Then it is of interest to compile it in the "start" section. If two or more sets are to be used at different points, then the compilation should be performed before each score computation.

pattern matching

patsearch, matchpos, matcherr, matchfrag

patsearch searches for a pattern in a sequence, allowing for errors.

Syntax:

( patsearch sequence pattern position max_mismatch format )

sequence and pattern are both string expressions respectively containing the (sub)sequence to scan, and the searched pattern. This pattern is written using the IUPAC code for ambiguous bases (e.g R=A or G) or the [] alternation form (e.g. [AG] means A or G).

If errors are to be allowed (see below), bases where no error is allowed must be followed by a # sign.

position and max_mismatch are numerical values respectively indicating where to start the scan, and the maximum number of mismatches allowed. format is a boolean value indicating the fragment output (see below).

patsearch returns "true" if the pattern is found at least once. When executed, patsearch fills the values of 3 internal variables:

matchpos: position of the first occurence of pattern in sequence;

matcherr: number of errors in this occurence;

matchfrag: actual sequence of this occurence. The format for matchfrag depends on the boolean value of format (last parameter in the patsearch instruction): if format is "true", then all nucleotides are displayed; if format is "false", then only matching nucleotides are displayed, the others being replaced by - signs.

Example:

(patsearch ( seq loop ) "CAG#TGH" 2 1 false )

searches for the pattern "CAGTGH" (H = C or A or T) in the current loop, starting at the second base, with 1 mismatch allowed anywhere except at the first G residue. If the loop sequence is ATCAGCGTC, then TCAGCG is examined first (not a hit), then CAGCGT, where the first occurence is found. At this point:

matchpos is set to 2, as the pattern is found at the 2nd examined position;

matcherr is set to 1 (C instead of T);

matchfrag is set to "CAG-GT";

patsearch stops, returning "true".

matrix scoring

scoremat

scoremat returns the matching score of a sequence against a scoring matrix.

Syntax:

( scoremat matrix sequence )

The format for a matrix spanning over n nucleotides is:

"G scG1 ... scGn A scA1 .. scAn T scT1 .. scTn C scC1 .. scCn"

where scXn is the score for a base X at the n-th position.

The score is computed by adding the corresponding scores.

Example:

( ge ( scoremat "G 0 40 99 10 A 0 50 0 0 T 100 5 1 10 C 0 5 0 80" loop ) 250 )

checks that the score of the current loop against the matrix is at least 250. The matrix is more clearly seen as:

G    0  40  99  10
A    0  50   0   0
T  100   5   1  10
C    0   5   0  80

If the loop is AAGG, then it scores 0+50+100+10 = 160 and "ge" returns FALSE;

if the loop is TGGC, then it scores 100+40+99+80 = 319 and "ge" returns TRUE.

display

print

prints to the standard output.

Syntax:

( print format expression )

The term "format" designates any alphanumeric string. In this string, the position of the value of "expression" is indicated by a special sign, depending on the type of the expression: "%s" stands for a string, "%d" for a number. The "new line" character is indicated by "\n".

Examples:

Let us assume we have a sequence called "TEST1", where the searched structure has been found. It is made of two successive helices, the 1st one starting at position 100, and the 2nd one ending at position 162. The three instructions:

( print "Sequence: %s - " ( seq seqname ) )
( print "Struct at %d " ( start head #1 ) )
( print "length %d bases.\n" ( sub ( end tail #2 ) ( start head #1 ) ) )

will result in the printing of the line:

Sequence: TEST1 - Struct at 100 length 62 bases.

Note:

Only 1 expression argument can currently be used by print instruction. Using more that one expression will probably cause Palingol to crash :(.


STRUCTURE

boolean operators

true, false, not, or, and, equiv

"true" and "false" are the two boolean constants. The other operators combine boolean expressions to produce a boolean result.

Syntax:

( not bool_expr )
( or bool_expr1 bool_expr2 )
( and bool_expr1 bool_expr2 )
( equiv bool_expr1 bool_expr2 )

Example:

( or ( ge head 6 ) ( le loop 5 ) ) returns "true" if the current helix is at least 6 bases long, or if the current loop is at most 5 bases long.

Notes:

selection

if

performs instructions if a condition is verified

Syntax:

( if ( ( bool_expr ) ) ( instruction(s) ) )

Example:

( if ( ( ge ( bpscore head #2 tail #2 ) 3 ) )
 ( set $score ( add $score 1 ) )
)

adds 1 to the variable $score if the pairing score of the second helix is >= 3.

loop

while

repeats instructions as long as a condition is verified.

syntax:

( while ( ( bool_expr ) ) ( instruction(s) ) )

Example:

( patsearch ( seq loop ) "WTG" 1 0 false )
( set $start ( add ( start loop ) matchpos ) )
( set $orf 0 )
( set $stop FALSE )
( while ( ( not $stop ) )
 ( set $orf ( add $start 3 ) )
 ( set $codon ( sstr ( seq fullseq ) ( add $start $orf ) 3 ) )
 ( set $stop ( or ( strcmp $codon "TAR" ) ( strcmp $codon "TGA" ) ) )
)

The first instruction searches for a start codon ATG or TTG (see patsearch) in the current loop (the program stops here if no start codon is found). Then the variable $start is set to the position of the first start encountered, $orf is set to 0, and $stop to FALSE. Now the "while" loop begins.
The loop will continue as long as $stop remains FALSE. At each turn, $orf is increased by 3, a trinucleotide fragment beginning at this position (relative to $start) is extracted and stored in $codon. If $codon is a stop codon (TAA or TAG or TGA), then $stop is set to TRUE and the loop stops. At the end of this loop, $orf contains the length of the open reading frame starting at the first ATG or TTG codon in the loop.