		The CRM114 Quick Reference Card

     Copyright W.S. Yerazunis, 2002-2003.  All rights reserved.
     This software is released under V2.1 of the Gnu Public License.
     Go to www.fsf.org to get a complete copy of the license.

     This is the CRM114 Language Quick Reference.  For information
     on the mailfilter, see the CRM114_Mailfilter_HOWTO.

-----  THE COMMAND LINE -------------

Invoke as 'crm whatever' or use '#!/usr/bin/crm' as the first line
of a script file containing the program text.

  -d N   - run N cycles, then drop into debugger.  If no N, debug immediately
  -e     - no environment variables imported
  -h     - print help text
  -p     - generate an execution-time-spent profile on exit
  -P N   - max program lines
  -q m   - mathmode (0,1 = alg/RPN only in EVAL, 2,3 = alg/RPN everywhere)
  -s N   - new feature file (.css) size is N (default 1 meg+1 featureslots)
  -S N   - new feature file (.css) size is N rounded to 2^I+1 featureslots
  -t     - user trace output
  -T     - implementors trace output (only for the masochistic!)
  -u dir - chdir to directory dir before starting execution 
  -v     - print CRM114 version identification and exit.
  -w N   - max data window (bytes, default 16 megs)
  --     - signals the end CRM114 flags; prior flags are not seen by 
	   the user program;   subsequent args are not processed by CRM114.
  --foo  - creates the user variable :foo: with the value SET
  --x=y  - creates the user variable :x: with the value y	   
  -{ stmts }  - execute the statements inside the {} brackets.

Absent the -{ program } flag, the first arg is taken to be the name of
a file containing a crm114 program, subsequent args are merely supplied
as :_argN: values.  Use single quotes around commandline programs 
'-{ like this }' to prevent the shell from doing odd things to your
command-line programs.  

CRM114 can be directly invoked by the shell if the first line of your
program file uses the shell standard, as in:

	#! /usr/bin/crm

You can use CRM114 flags on the shell-standard invocation line, and
hide them with '--' from the program itself; '--' incidentally prevents
the invoking user from changing any CRM114 invocation flags.

Flags should be located after any positional variables on the command
line.  Flags _are_ visible as :_argN: variables, so you can create
your own flags for your own programs (separate CRM114 and user flags
with '--').  

Examples:

   ./foo.crm bar mugga < baz  -t -w 150000      <--- Use this

   ./foo.crm -t -w 1500000 -- bar < baz mugga   <--- or this 

   ./foo.crm -t -w 150000 bar < baz mugga      <--- NOT like this


You can put a list of user-settable vars on the '#!/usr/bin/crm'
invocation line.  CRM114 will print these out when a program is
invoked directly (e.g. "./myprog.crm -h", not "crm myprog.crm -h")
with the -h (for help) flag.  (note that this works ONLY on bash
on Linux- *BSD's have a different bash interpretation and this
doesn't work)

Example:
 
#!/usr/bin/crm  -( var1 var2=A var2=B var2=C )

			- allows only var1 and var2 be set on the
                          command line.  If a variable is not assigned
                          a value, the user can set any value desired.
                          If the variable is equated to a set of
                          values, those are the _only_ values allowed.

#!/usr/bin/crm  -( var1 var2=foo )  --    

		        - allows var1 to be set to any value, var2 may
                          only be set to either "foo" or not at all,
                          and no other variables may be set nor may
                          invocation flags be changed (because of the
                          trailing "--").  Since "--" also blocks '-h'
                          for help, such programs should provide their
                          own help facility.

----- VARIABLES ----------

Variable names and locations start with a : , end with a : , and may
contain only characters that have ink (i.e. the [:graph:] class) with
few exceptions.

Examples :here: , :ThErE:, :every-where_0123+45%6789: , 
:this_is_a_very_very_long_var_name_that_does_not_tell_us_much: .
  
Builtin variables: 
	  :_nl: - newline
	  :_ht: - horizontal tab
	  :_bs: - backspace
	  :_sl: - a slash
	  :_sc: - a semicolon
	  :_arg0: thru :_argN: - command-line args, including _all_ flags
	  :_argc: - how many command line arguments there were
	  :_pos0: thru :_posN: - positional args ('-' or '--' args deleted)
	  :_posc: - how many positional arguments there were
	  :_pos_str:  - all positional arguments concatented
          :_env_whatever: - environment value 'whatever'
	  :_env_string:  - all environmental arguments concatenated
	  :_crm_version: - the version of the CRM system
	  :_dw: - the current data window contents


----  VARIABLE EXPANSION  ----

Variables are expanded by the ':*:' var-expansion operator,
e.g. :*:_nl: expands to a newline character.  Uninitialized vars
evaluate to their text name (and the colons stay).

You can also use the standard constant C '\' characters, such as "\n"
for newline, as well as excaped hexadecimal and octal characters like
\xHH and \oOOO but these are constants, not variables, and cannot be
redefined.

Depending on the value of "math mode" (flag -q). you can also use
:#:string_or_var: to get the length of a string, and :@:string_or_var:
to do basic mathematics and inequality testing, either only in EVALs
or for all var-expanded expressions.  See "Sequence of Evaluation"
below for more details.


-----  PROGRAM BEHAVIOR  ----

Default behavior is to read all of standard input till EOF into the
default data window (named :_dw:), then execute the program (this is
overridden if first executable statement is a WINDOW stmt).

Variables don't get their own storage unless you ISOLATE them (see
below), instead variables are start/length pairs indexing into the
default data window.  Thus, ALTERing an unISOLATEd variable changes
the value of the default data buffer itself.  This is a great power,
so use it only for good, and never for evil.



--- STATEMENTS AND STUFF (separate statements with a ';' or with a newline) --

 \      - '\' is the string-text escape character.  You only _need_ to
           escape the literal representation of closing delimiters
           inside var-expanded arguments.

           You can use the classic C/C++ \-escapes, such as \n, \r,
           \t, \a, \b, \v, \f, \0, and also \xHH and \oOOO for hex and
           octal characters, respectively.


           A '\' as the _last_ character of a line means the next line
           is just a continuation of this one.

           A \-escape that isn't recognized as something special isn't
           an error; you may _optionally_ escape any of these delimiters:

                         > ) ] } ; / # \

           and get just that character.

           A '\' anywhere else is just a literal backslash, so the regex
           ([abc])\1 is written just that way; there is no need to
           double-backslash the \1 (although it will work if you do).


# this is a comment

# and this too \#        - A comment is not a piece of preprocessor sugar-
                      it is a -statement- and ends at the newline or at "\#"


insert filename                   - inserts the file verbatim at this
				    line at compile time.


;                                - statement separator - must ALWAYS
				    be escaped as \; unless it's
				    inside delimiters or else it will
				    mark the end of the statement.


{ and }                           - start and end blocks of
                                    statements. Must always be '\'
                                    escaped or inside delimiters or
                                    these will mark the start/end of a
                                    block.


noop			          - no-op statement


:label:                           - define a GOTOable label 


accept			          - writes the current data window to standard 
                                    output; execution continues.


alius				  - if the last bracket-group succeeded, ALIUS
				    skips to end of {} block (a skip, not a 
                                    FAIL); if the prior group FAILed,
				    ALIUS does nothing.  Thus, ALIUS is both 
				    an ELSE clause and a CASE statement.  


alter (:var:) /new-val/           - destructively change value of var to newval
      (:var:)                       - var to change (var-expanded)
              /new-val/             - value to change to (var-expanded)


classify <flags> (:c1:...|...:cN:) (:stats:) [:in:] /word-pat/ - compare the 
					  statistics of the current data window
					  buffer with classfiles c1...cN
         <nocase>                       - ignore case in word-pat, does
					  not change case in hash (use tr()
					  to do that on :in: if you want it)   
	       (:c1: ...                  file or files to consider "success"
					  files.  The CLASSIFY succeeds if
					  these files as a group match best.
                                          if not, the CLASSIFY does a FAIL.
                      |                 - optional separator.  Spaces on each 
                                          side of the " | " are required.
                       .... :cN:)       - optional files to the right of " | "
					  are considered as a group to "fail".
					  If statement fails, execution skips 
					  to end of enclosing {..} block, 
                                          which exits with a FAIL status (see
                                          ALIUS for why this is useful).
                    (:stats:)		- optional var that will get a text
                                          formatted matching summary
			[:in:]	        - restrict statistical measure to
					  the string inside :in:
			  /word-pat/    -  regex to describe what a 
					  parseable word is.


eval (:result:) /instring/       - repeatedly evaluates /instring/ until it 
                                   ceases to change, then places that result
                                   as the value of :result: .  EVAL uses 
                                   smart (but foolable) heuristics to avoid
                                   infinite loops, like evaluating a string
                                   that evaluates to a request to evaluate
                                   itself again.  The error rate is about 
                                   1 / 2^62 and will detect chain groups of
                                   length 255 or less.

				   If the instring uses math evaluation
				   (see section below on math operations)
				   and the evaluation has an inequality
				   test, (>, < or =) then if the inequality
				   fails, the EVAL will FAIL to the end of
				   block.  If the evaluation has a numeric
				   fault (e.g. divide-by-zero) the EVAL will
				   do a TRAPpable FAULT. 
 

exit  /:retval:/		 - ends program execution.  If supplied, the
				   return value is converted to an integer 
				   and returned as the exit code of the 
				   crm114 program.  
      /:retval:/                 - variable to be converted to an integer
                                   and returned.  If no retval is supplied,
				   the return value is 0.


fail				 - skips down to end of the current { } block
                                   and causes that block to exit with a FAIL
                                   status (see ALIUS for why this is useful)


fault /faultstr/                 - forces a FAULT with the given string as
                                   the reason.  
      /faultstr/                    - the val-expanded fault reason string 


goto /:label:/                   - unconditional branch (you can use 
				   a variable as the goal, e.g. /:*:there:/ )


hash (:result:) /input/          - compute a fast 32-bit hash of the 
                                   /input/, and ALTER :result: to the 
                                   hexadecimal hash value.  HASH is
                                   _not_ warranted to be constant across
                                   major releases of CRM114, nor is it
				   cryptographically secure.
     (:result:)                     - value that gets result.
               /input/              - string to be hashed (can contain 
                                      expanded :*:vars: , defaults to 
				      the data window :_dw: )


intersect (:out:) [:var1: :var2: ...] - makes :out: contain the part of the
                                   data window that is the intersection of
                                   :var1 :var2: ...  ISOLATEd vars are ignored.
                                   This only resets the value of the captured
				   variable, and does NOT alter any text in 
                                   the data window.
				      				     

isolate (:var:) /initial-value/  - puts :var: into a data area outside of the 
				   data buffer; subsequent changes to this 
				   var don't change the data buffer (though 
                                   they may change the value of any var
				   subsequently set inside of this var).  
				   If the var already was ISOLATED, this is 
				   a noop.
       (:var:)                      - name of ISOLATEd var (var-expanded)
                /initial-value/     - optional initial value for :var:
				      (var-expanded).  If no value is
				      supplied, the previous value is
				      retained/copied.

input <flags> (:result:) [:filename:] - read in the content of filename 
                                        if no filename, then read stdin
      <byline>                        - read one line only
               (:result:)             - var that gets the input value
		       [:filename:]   - the file to read


learn <flags> (:class:) [:in:] /word-pat/ - learn the statistics of the :in: 
				       var (or the input window if no var)
                                       as an example of class :class:
      <nocase>                       - ignore case in matching word-pat (does
				       not ignore case in hash- use tr() to
				       do that on :in: if you want it)
      <refute>                       - this is an anti-example of this
				       class- unlearn it!
      <microgroom>                   - enable the microgroomer to purge
                                       less-important information automatically
                                       whenever the statistics file gets to
                                       crowded.
              (:class:)              - name of file holding hashed results;
                                       nominal file extension is .css
                    [:in:]           - captured var containing the text
                                       to be learned (if omitted, the full
                                       contents of the data window is used)
                        /word-pat/   - regex that defines a "word".  Things
				       that aren't "words" are ignored.



liaf				 - skips UP to START of the current {} block
					 (LIAF is FAIL spelled backwards)


match <flags> (:var1: ...) [:in:] /regex/  - Attempt to match the given regex;
                                   if match succeds, variables are bound;
                                   if match fails, program skips to the
                                   closing '}' of this block
      <absent>                     - statement succeeds if match not present
      <nocase>                     - ignore case when matching
      <literal>	                   - No special characters in regex (only
                                     supported with TREregex, not GNUregex.)
      <fromstart>                  - start match at start of the [:in:] var
      <fromcurrent>		   - start match at start of previous 
				     successful match on the [:in:] var
      <fromnext>                   - start match at one character past
                                     the start of the previous successful
                                     match on the [:in:] var
      <fromend>                    - start match at one character past
                                     the end of prev. match on this [:in:] var
      <newend>                     - require match to end after end of
                                     prev. match on this [:in:] var
      <backwards>                  - search backward in the [:in:] variable
				     from the last successful match.
      <nomultiline>                - don't allow this match to span lines
              (:var1: ...)         - optional variables to bind to regex
                                     result and '(' ')' subregexes
                     [:in:]        - search only in the variable specified;
                                     if omitted, :_dw: (the full input data
                                     window) is used
                         /regex/   - POSIX regex (with \ escapes as needed)
    
              If you build CRM114 to use the GNU regex library for MATCHing,
	      be warned that GNU REGEX has numerous issues.  See the 
	      KNOWN_BUGS file for a detailed listing.


output <flags> [filename] /output-text/ - output an arbitrary string 
			            with captured values expanded.
       <append>			  - append to the file (otherwise, overwrites)
       [filename]                 - filename to send output (var-expanded),
                                    default output is to stdout
              /output-text/       - string to output (var-expanded)


syscall <flags> (:in:) (:out:) (:status:) /command/ - execute a shell command
        <keep>                      - keep this process around; if kept,
                                      then a syscall with the same :keep:
                                      var will continue feeding to and 
                                      reading from the kept proc.
        <async>                     - don't wait for process to send an
                                      EOF; just grab what's available in 
                                      the process's output pipe and proceed
				      (limit per syscall is 256 Kbytes)
               (:in:)               - var-expanded string to feed to command
                                      as input (can be null if you don't want
                                      to send the process something.)  You
				      _MUST_ specify this if you want to 
				      specify an :out: variable.
                (:out:)             - var-expanded varname to place results
                                      into (MUST pre-exist, can be null if
                                      you don't want to read the process's
                                      output (yet, or at all).  Limit per
				      syscall is 256 Kbytes.  You _MUST_
                                      specify this if you want to use the
                                      :status: variable)
                  (:status:)        - if you want to keep a minion proc
                                      around, or catch the exit status
				      of the process, specify a var here.  
				      The minion process's PID and pipes 
                                      will be stored here.  The program
				      can access the proc again with 
                                      another syscall by using this var again.
                                      When the process exits, it's exit code
                                      will be stored here.


trap (:reason:) /trap_regex/     - traps faults from both FAULT statements
                                   and program errors occurring anywhere in
				   the preceding bracket-block.  If no fault
				   exists, TRAP does a SKIP to end of block.
				   If there is a fault and the fault reason
                                   string matches the trap_regex, the fault 
				   is trapped, and execution continues with
				   the line after the TRAP, otherwise the 
                                   fault is passed up to the next surrounding 
                                   trapped bracket block.
     (:reason:)                     - the fault message that caused this
                                      FAULT.  If it was a user fault, this
                                      is the text the user supplied in the
                                      FAULT statement.
          /trap_regex/              - the regex that determines what kind of
				      faults this TRAP will accept.  Putting
				      a wildcard here (e.g. /.*/ means that
				      ALL faults will be trapped here.


union (:out:) [:var1: :var2: ...] - makes :out: contain the union of the data
                                   window segments that contains var1, var2... 
                                   plus any intervening text as well.  Any 
                                   ISOLATEd var is ignored.  This is 
                                   non-surgical, and does not alter the 
                                   data window


window <flags> (:w-var:) (:s-var:) /cut-regex/ /add-regex/ - window slider.
				   This deletes to and including the
				   cut-regex from :var: (default: use the 
                                   data window), then reads adds from std. 
                                   input till add-regex (inclusive).
       <nocase>                    - ignore case when matching cut- and add-
				     regexes
       <bychar>                    - check input for add-regex every character
       <byline>                    - check input for add-regex every line
       <byeof>                     - wait for EOF to check for add-regex (extra
				     characters are kept around for later)
       <eofends>                   - read lots of input; the input is up to the
                                     regex match OR the contents till EOF
            (:w-var:)              - what var to window
	       (:s-var:)           - what var to use for source (defaults to
				     stdin, if you use a source var you _must_
                                     specify the windowed var.
              /cut-regex/          - var-expanded cut pattern
                      /add-regex/  - var-expanded add pattern, if absent 
                                     reads till EOF

                            *****    If both cut-regex and add-regex are 
				     omitted, and this window statement is 
				     the _first_ _executable_ statement in
				     the program, then CRM114 does _not_ wait
				     to read a anything from standard input
				     input before starting program execution.

     ------------ A Quick Regex Intro ---------

A regex is a pattern match.  Do a "man 7 regex" for details.

Matches are, by default "first starting point that matches, then 
longest match possible that can fit".  

  a through z
  A through Z   - all match themselves
  0 thorugh 9
  
  most punctuation - matches itself, but check below!

  .       - matches any character	

  *       - repeat preceding 0 or more times
  
  +       - repeat preceding 1 or more times
 
  ?       - repeat preceding 0 or 1 time 

  *?, +?, ??  - repeat preceding, but _shortest_ match that fits, given
   	        the already-selected start point of the regex. (only
	        supported by TRE regex, not GNU regex)

  [abcde]    any one of the letters a, b, c, d, or e
	      
  [a-q]      the letters a through q (just one of them)

  {n,m}      repetition count: match the preceding at least n and no more
	     than m times (POSIX restricts this to a maximum of 255
	     repeats)

  [[:<:]]    matches at the start of a word (GNU regex only)

  [[:>:]]    matches the end of a word (GNU regex only)

  ^          as first char of a match, matches the start of a line (ONLY in
                <nomultiline> matches.

  $          as last char of a match, matches at the end of a line (ONLY in 
                <nomultiline> matches)
 
  .         (a period) matches any _single_ character (except start-of-line or
            end of line "virtual characters", but it does match a newline).

  a|b        match a _or_ b

  (match)    - the () go away, and the string that matched inside is
	     available for capturing.  Use \\( and \\) to match actual 
	     parenthesis (the first '\' tells "show the second '\' to 
	     the regex engine, the second '\' forces a literalization
	     onto the parenthesis character. 

  \n        - matches the N'th parenthesized subexpression.  Remember to
	      backslash-escape the backslash (e.g. write this as \\1)
	      This is only if you're using TRE, not GNU regex.

The following are other POSIX expressions, which mostly do what you'd
guess they'd do from their names.

  [[:alnum:]]
  [[:alpha:]]
  [[:blank:]] 
  [[:cntrl:]]
  [[:digit:]] 
  [[:lower:]]
  [[:upper:]] 
  [[:graph:]]  <-- any character that puts ink on paper or lights a pixel
  [[:print:]]  <-- any character that moves the "print head" or cursor.
  [[:punct:]] 
  [[:space:]] 
  [[:xdigit:]]


    --------------  Notes on Sequence of Evaluation -------------

By default, CRM114 supports string length and mathematical evaluation
only in an EVAL statement, although it can be set to allow these in
any place where a var-expanded variable is allowed (see the -q flag).
The default value ( zero ) allows stringlength and math evaluation
only in EVAL statements, and uses non-precedence (that is, strict
left-to-right unless parenthesis are used) algebraic notation.  -q 1
uses RPN instead of algebraic, again allowing stringlength and math
evaluation only in EVAL expressions.  Modes 2 and 3 allow stringlength
and math evaluation in _any_ var-expanded expression, with
non-precedence algebraic notation and RPN notation respectively.

Evaluation is always left-to-right; there is no precedence of
operators beyond the sequential passes noted below.

The evaluation is done in four sequential passes:

 1)   \-constants like \n, \o377 and \x3F are substituted 

 2)   :*:var: variables are substituted (note the difference between
      a constant like '\n' and a variable like ":*:_nl:" here - constants
      are substituted first, then variables are substituted.)

 3)   :#:var: string-length operations are performed

 4)   :@:expression: mathematical expressions are performed; syntax is
      either RPN or non-precedenced (parens required) algebraic
      notation.  Embedded non-evaluated strings in a mathematical
      expression is currently a no-no.
    
      Allowed operators are:  + - * / % > < = only.

      Only >, <, and = set logical results; they also evaluate to
      1 and 0 for continued chain operations - e.g. 

	((:*:a: > 3) + (:*:b: > 5) + (:*:c: > 9) > 2)

      is true IFF any of the following is true

	 a > 3 and b > 5
	 a > 3 and c > 9
	 b > 5 and c > 9


    -------------- Notes on Approximate REGEX matching ---------

Only the TRE engine supports approximate matching.  The GNU engine does
not support approximate matching.

Approximate matching is specified similarly to a "repetition count" in
a regular regex, using brackets.  This approximation applies to the
previous parenthesized expression (again, just like repetion counts).
You can specify maximum total changes, and how many inserts, deletes,
and substitutions you wish to allow.  The minimum-error match is found
and reported, if it exists within the bounds you state.

The basic syntax is:
  
  (text-to-match){~[maxerrs] [#maxsubsts] [+maxinserts] [-maxdeletes]}

Note that the '~' (with an optional maxerr count) is _required_ (that's how
we know it's an approximate regex rather than just a rep-count); if you
don't specify a max error count, you will get the best match, if you do,
the match will have at most that many errors.

Remember that you specify the changes to the text in the _pattern_
necessary to make it match the text in the string being searched.

You cannot use approximate regexes and backrefs (like \1) in the same 
regex.  This is a limitation of in TRE at this point.

You can also use an inequality in addition to the basic syntax above:

  (text-to-match){~[maxerrs] [basic-syntax] [nI + mD + oS < K] }

where n, m, and o are the costs per insertion, deletion, and substitution
respectively, 'I', 'D', and 'S' are indicators to tell which cost goes
with which kind of error, and K is the total cost of the errors; the cost
of the errors is always strictly less than K.

Here are some examples.

  (foobar)       - exactly matches "foobar"

  (foobar){~}    - finds the closest match to "foobar", with the minimum number
		   of inserts, deletes, and substitutions.  Always succeeds.

  (foobar){~3}   - finds the closest match to "foobar", with no more than 3
	           inserts, deletes, or substitutions

  (foobar){~2 +2 -1 #1) - find the closest match to "foobar", with at most
		   two errors total, and at most two inserts, one delete,
		   and one substitution.

  (foobar){~4 #1 1i + 2d < 5 } - find the closest match to "foobar",
		   with at most four errors total, at most one substitution,
		   and with the number of insertions plus 2x the number of
		   deletions less than 5.
  
  (foo){~1}(bar){~1) - find the closest match to "foobar", with at most one
		   error in the "foo" and one error in the "bar".


				     
     ------------ Overall Language Notes ------------

Here's how to remember what goes where in the CRM114 language.

Unlike most computer languages, CRM114 uses inflection (or declension)
rather than position to describe what role each part of a statement
plays.  The declensions are marked by the delimiters- the /, ( and ), <
and >, and [ and ].

By and large, you can mix up the arguments to each kind of statement
without changing their meaning.  Only the ACTION needs to be first.
Other parts of the statement can occur in any order, save that
multiple (paren_args) and /pattern_args/ must stay in their nominal
order but can go anywhere in the statement.  They do not need to be
consecutive.

The parts of a CRM114 statement are:

	  ACTION	     - the verb.  This is at the start of the 
                               statment.

          /pattern/	     - the overall pattern the verb should 
                               use, analogous to the "subject" of the 
                               statement.

	  <flags>	     - modifies how the ACTION does the work. 
			       You'd call these "adverbs" in human 
                               languages.

	  (vars)	     - what variables to use as adjuncts in 
                               the action (what would be called the 
                               "direct objects").  These can get changed
			       when the action happens.

	  [limited-to]       - where the action is allowed to take place 
			       (think of it as the "indirect object").  
			       These are not directly changed by the action.
			       





