TSGrep.pas    copyright (C) 2000   R. Collins
rcoll@rtd.com

LEGAL STUFF:

This software is provided "as-is".  This software comes without warranty 
or guarantee, explicit or implied.  Use this software at your own risk.  
The author will not be liable for any damage to equipment, data, or information
that may result while using this software.

By using this software, you agree to the conditions stated above.

This software may be used freely, provided the author's name and copyright
statement remain a part of the source code.

THE CODE:

TSGrep is a "simple grep" component that does not rely on an external DLL or
library.  All functionality is contained within the TSGrep object.  All
regular-expression parsing, searching, and matching are contained within this
one source-code file.  Since it does not rely upon an external library, this
makes it easier to export your program to other machines without dragging along
extra files.  However, since it is a simple search and match, perhaps not all 
the functions you need are implemented.

If you have need of a function that is not implemented, and do not know how to
include it yourself, e-mail me and I'll see if I can code it for you.  Be aware: 
I reserve the right to charge a fee for my time (I also reserve the right to work 
for free).  In either case, I work cheap.

This component was written using Borland's Delphi 3, but should be source-code
compatible with later versions of Delphi.  I cannot state if the code will compile
on Delphi 1 or Delphi 2.  However, since you have the source code, you should be
able to modify it to work on the earlier versions of Delphi.

It is written with the understanding that other programmers will need to read it, 
and possibly modify it for their own specific needs. Comments are meant to be 
descriptive and informative; however, I am assuming an experienced software engineer 
level of experience.  This code is not written for beginners.

The TSGrep component is a descendant of TObject, which means it is a top-level
object.  It will have the basic behavior common to all objects in Delphi.


KNOWN PROBLEMS and THINGS TO DO:

The parsing and sorting routines are rather slow and inefficient.  They are fine
for short and simple searching, but for longer or complicated regular expressions,
the parse and sort to build the search tree could be better.  However, once the
search tree has been build, the actual searching is pretty quick.

There is currently no way to save a search tree (after parsing the regular
expression) to a file.  Saving a search tree that is used often or across several
applications would considerably speed up initial parsing and sorting.

Complicated regular expressions can result in a lot of "dead code" in the search
tree.  This is just wasted memory.  Currently, there is no "garbage collection" 
procedure to return this memory to the system until the entire TSGrep object is 
released (see TSGrep.Free) and all memory is returned to the system.

The TSGrep object only searches through strings.  It would be good to have it
also search through files or streams.

REGULAR EXPRESSIONS

Almost everyone is familiar with the famous MS-DOS style of searching for
files *.TXT, or *.*.  These are trivial regular expressions.  In MS-DOS,
the "*" character matches any number of any characters; the "?" matches
any single character.

A regular expression search engine simply expands on this concept.  Various
characters or strings are used to match different types of targets; in addition,
the use of boolean operators AND "&" and OR "|" allow multiple possibilities to 
be found within the source string.

When searching a source string for targets, there are two values that define
the current search position: the ANCHOR and the BEAD.  The anchor is defined
to be the starting position of the search; i.e., this value 'anchors' the
pattern matching to a specific starting point in the source string.  By
default all positions in the source string are tested (anchored) until the 
first match is found.  Setting a value to the anchor insures that only targets
found at that position are matched.  The bead (sometimes called the cursor)
is the current point in the source string where a match is declared a
success.  Taken together, the string slice SOURCE[ANCHOR..BEAD] define the
substring where a matching target is found.


Following are the pattern matching operators used by SGrep:

?		question mark
		Matches one character of anything (MS-DOS style).
*		asterik
		Matches any number of anything (MS-DOS style).
"xxx"		double-quotes
		Matches literal characters inside the quotes.
\000		back-slash
		Match literal character; if the number starts with 0,
		then the numeric code is octal.
\999		back-slash
		Match literal character; if the number starts with 1-9,
		then the numeric code is in decimal.
\xHH		back-slash
		Match literal character; if the digit string starts with
		a lower-case "x", then the next 2 characters are hex digits.
\n		Match a literal line-feed.
\r		Match a literal carraige-return.
\t		Match a literal tab character.
\f		Match a literal form-feed character.
[xxx]		square brackets
		Match any single character within the brackets.  You can use
		the dash "-" to span characters in sequence.  For example,
		[1-9] is the same as [123456789].
[-xxx]	square brackets negated
		Match any single character NOT within the brackets.  If the
		first character after the opening "[" is a dash "-", then the
		character set is negated; i.e., any character not within the
		set is matched.  You can still use a dash to span sequential
		characters; [-1-9] is the same as [-123456789], and will match
		any character that is not a digit 1 through 9.  The set
		[-a-zA-Z] is the same as 
		[-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] and will
		match any character that is not an alphabetic character.
%x		percent sign
		Special character span; where "x" specifies the type of spanning.
		%W	match any number of white space characters (space or tab)
		%w	match a single white space
		%N	match any number of digits "0123456789"
		%n	match any single digit "0123456789"
		%A	match any number of alphabetics "a-z" or "A-Z"
		%a	match any single alphabetic "a-z" or "A-Z"
		%X	match any number of alphanumerics "0-9", "a-z" or "A-Z"
		%x	match any single alphanumeric "0-9", "a-z", or "A-Z"
%-x		percent sign negated
		Special character span; where "x" specifies the type of spanning.
		In this case, the "-" negates the span, causing the pattern to
		match a target that does NOT consist of the span set.
		%W	match any number characters that are NOT white space (space or tab)
		%w	match a single character that is NOT white space
		%N	match any number of characters that are NOT digits "0123456789"
		%n	match any single character that is NOT a digit "0123456789"
		%A	match any number of characters that are NOT "a-z" or "A-Z"
		%a	match any single character that is NOY  "a-z" or "A-Z"
		%X	match any number of characters that are NOT "0-9", "a-z" or "A-Z"
		%x	match any single character that is NOT "0-9", "a-z", or "A-Z"
@nnn		at
		Match bead position.  The "nnn" is a decimal string specifying a position
		within the source string.  The match succeeds if the current bead is at
		the given position.  The source bead positions start at 0 (for beginning
		of line) and extend to the length of the string (for end of line matching).
		Specifying a bead position matches targets only if they start at that
		bead position.  For example, the pattern
			@0 "Cat"
		matches the characters "Cat" only if they appear at the beginning of the
		source string.
@-nnn		at negated
		The bead position, counted from the end of the string.  For example,
			"Cat" @-0
		matches the character "Cat" only if they appear at the end of the source
		string.
&		ampersand
		Boolean operator AND.  This forces the two expressions on either side
		of the & to be evaluated, and both must be TRUE to succeed.  This
		operator is not normally needed, since a logical AND is the default
		conjunction of a series of expressions.
|		bar
		Boolean operator OR.  Allows either of the two expressions on either
		side to be evaluated.  If either is TRUE, then the match succeeds.
		For example
			"Cat" | "Dog"
		will match either the characters "Cat" or "Dog".  The left expression
		is evaluated first; if it succeeds, then the right expression is
		skipped; if the left expression fails, then the right-side expression
		is tested.
(xxx)		parentheses
		Group expressions into a hierarchy.  For example
			(%n%n%n "-" %n%n%n%n) | "Cat"
		will match either a string of characters that look like a phone
		number (123-4567), or the characters "Cat", whichever it finds first.
{min:max}	curly braces
		Defines a repeat count for the next expression.  The next expression
		must be matched a minimum of "min" times, and no more than "max"
		times.  If only one number is given, (for example {5}) then the next
		expression will be matched an exact number of times.  For example
			{3} ("Cat" | "Dog")
		will match exactly 3 occurrences of either "Cat" or "Dog".
!x		shrike
		Writes output via the OnWrite event.  The "x" decides what value is
		to be written:
		!A		writes the current value of the anchor as a string
		!B		writes the current value of the bead as a string
		!S		writes the current matching substring (indexed anchor
				to bead)
		!T		writes the current matching total string (indexed
				0 to bead)
		See the OnWrite event for more information.


INTERFACE

constructor Create;
	Creates a new (empty) grep component.

constructor CreateFromPattern(const Pattern: string);
	Creates a new grep component, and parses the pattern into a search  tree.
	Once parsed, the pattern may be used multiple times with multiple source 
	strings.  The overhead of parsing the pattern is performed only once.

procedure Free;
	Releases the memory used by a grep component, and returns all resources used
	by the grep component back to the system.  This grep component may not be 
	used again until it is re-created by a constructor call.

function MatchFirst(Source: string): integer;
	Searches the source string for the first instance of the pattern, and returns
	the index of the match.  The index of a string is counted as 1..n, where n
	is the length of the string.  If the pattern is not found, 0 is returned.

function MatchNext: integer;
	Used after a call to MatchFirst, returns the index of the next matching
	occurrence.  If the pattern is not found, 0 is returned.

function Execute(Patter, Source: string): boolean;
	Parses the pattern and initializes the search tree, then searches the source
	string for any matching pattern.  Returns TRUE if the pattern is found, or
	FALSE if it is not found.

procedure Dump_OpList(Strings: TStrings);
	Used for debugging purposes; writes a formatted copy of the search tree to
	the string list.

property Tag: integer;			(read write)
	A general purpose integer for use by the programmer.  Typically used to 
	uniquely identify a component.

property Pattern: string;		(read write)
	Defines the current pattern to search for.  When set, the pattern is parsed
	and sorted into a search tree.  The overhead of parsing and sorting is
	performed only once, although the pattern may be used to search several
	different source strings.

property Source: string;		(read write)
	Defines the string to be searched.  No parsing or sorting is performed on
	the source string; it is simply scanned, character by character, to find
	a matching pattern.

property Success: boolean;		(read only)
	Returns the success TRUE or FALSE of the last search performed by MatchFirst,
	MatchNext, or Execute.  When read within an event handler, returns the
	current state of the search, even though the search may not yet be finished.

property Anchor: integer;		(read write)
	Defines the beginning index in a source string from which to start a search.
	If a search is already started, it returns the current index of a source
	string that is being tested for a pattern match.

property Bead: integer;			(read only)
	During a pattern search, returns the current index of the character to be
	matched against the pattern.  The bead is always greater than or equal to
	the anchor.

property SubString: string;		(read only)
	Returns the string that matched the pattern.  When used in an event handler,
	the SubString is the current incomplete matching string.

property OnError: TGrepEvent;		(read write)
	Defines the procedure to call when an error occurs.  Most errors occur during
	the parsing or sorting phase, and represent an invalid operator syntax.  The
	OnError event provides a character string containing a diagnostic of the
	error.
	TgrepEvent = procedure(Sender: TObject; Message: string);

property OnWrite: TGrepEvent;		(read write)
	Provides an event for the write operations (!S, !T, !A, !B).  The string 
	passed in the event handler is the string written by the operator.
	TgrepEvent = procedure(Sender: TObject; Message: string);