Seth Woolley's Man Viewer

Manual for regexp - man 3 regexp

([section] manual, -k keyword, -K [section] search, -f whatis)
man plain no title

REGEXP(3)                                                            REGEXP(3)



NAME
       regcomp, regexec, regsub, regerror - regular expression handler

SYNOPSIS
       #include <regexp.h>

       regexp(3,n) *regcomp(exp)
       char *exp;

       int regexec(prog, string(3,n))
       regexp(3,n) *prog;
       char *string(3,n);

       regsub(prog, source, dest)
       regexp(3,n) *prog;
       char *source;
       char *dest;

       regerror(msg)
       char *msg;

DESCRIPTION
       These  functions  implement egrep(1)-style regular expressions and sup-
       porting facilities.

       Regcomp compiles a regular expression into a structure of type  regexp(3,n),
       and  returns  a pointer to it.  The space has been allocated using mal-
       loc(3) and may be released by free.

       Regexec matches a NUL-terminated string(3,n) against  the  compiled  regular
       expression  in(1,8)  prog.   It returns 1 for success and 0 for failure, and
       adjusts the contents of prog's startp and endp (see below) accordingly.

       The  members  of a regexp(3,n) structure include at least the following (not
       necessarily in(1,8) order):

              char *startp[NSUBEXP];
              char *endp[NSUBEXP];

       where NSUBEXP is defined (as 10) in(1,8) the header file.  Once a successful
       regexec has been done using the regexp(3,n), each startp-endp pair describes
       one substring within the string(3,n), with the startp pointing to the  first
       character of the substring and the endp pointing to the first character
       following the substring.  The 0th substring is the substring of  string(3,n)
       that  matched  the whole regular expression.  The others are those sub-
       strings that  matched  parenthesized  expressions  within  the  regular
       expression,  with  parenthesized  expressions numbered in(1,8) left-to-right
       order of their opening parentheses.

       Regsub copies source to dest, making  substitutions  according  to  the
       most  recent  regexec  performed  using  prog.  Each instance of `&' in(1,8)
       source is replaced by the substring indicated by startp[0] and endp[0].
       Each instance of `\n', where n is a digit, is replaced by the substring
       indicated by startp[n] and endp[n].  To get a literal `&' or `\n'  into
       dest,  prefix  it with `\'; to get a literal `\' preceding `&' or `\n',
       prefix it with another `\'.

       Regerror is called whenever an error(8,n) is detected in(1,8)  regcomp,  regexec,
       or regsub.  The default regerror writes the string(3,n) msg, with a suitable
       indicator of origin, on the standard error(8,n) output and invokes  exit(3,n,1 builtins)(2).
       Regerror can be replaced by the user if(3,n) other actions are desirable.

REGULAR EXPRESSION SYNTAX
       A  regular  expression  is zero or more branches, separated by `|'.  It
       matches anything that matches one of the branches.

       A branch is zero or more pieces, concatenated.  It matches a match  for
       the first, followed by a match for the second, etc.

       A piece is an atom possibly followed by `*', `+', or `?'.  An atom fol-
       lowed by `*' matches a sequence of 0 or more matches of the  atom.   An
       atom  followed  by  `+'  matches a sequence of 1 or more matches of the
       atom.  An atom followed by `?' matches a match of the atom, or the null
       string.

       An  atom  is  a regular expression in(1,8) parentheses (matching a match for
       the regular expression), a range (see below), `.'  (matching any single
       character), `^' (matching the null string(3,n) at the beginning of the input
       string(3,n)), `$' (matching the null string(3,n) at the end of the input string(3,n)),
       a  `\'  followed  by a single character (matching that character), or a
       single character with no other significance (matching that  character).

       A  range  is  a  sequence  of characters enclosed in(1,8) `[]'.  It normally
       matches any single character from the sequence.  If the sequence begins
       with  `^',  it  matches  any  single character not from the rest of the
       sequence.  If two characters in(1,8) the sequence are separated by `-', this
       is  shorthand  for the full list of ASCII characters between them (e.g.
       `[0-9]' matches any decimal digit).  To include a literal  `]'  in(1,8)  the
       sequence,  make  it the first character (following a possible `^').  To
       include a literal `-', make it the first or last character.

AMBIGUITY
       If a regular expression could match two different parts  of  the  input
       string(3,n),  it will match the one which begins earliest.  If both begin in(1,8)
       the same place    but match different lengths, or match the same length
       in(1,8) different ways, life gets(3,n) messier, as follows.

       In  general,  the possibilities in(1,8) a list of branches are considered in(1,8)
       left-to-right order, the possibilities for `*', `+', and `?'  are  con-
       sidered longest-first, nested constructs are considered from the outer-
       most in(1,8), and concatenated  constructs  are  considered  leftmost-first.
       The  match that will be chosen is the one that uses the earliest possi-
       bility in(1,8) the first choice that has to be made.  If there is more  than
       one  choice,  the next will be made in(1,8) the same manner (earliest possi-
       bility) subject to the decision on the first choice.  And so forth.

       For example, `(ab|a)b*c' could match `abc' in(1,8) one  of  two  ways.   The
       first  choice  is between `ab' and `a'; since `ab' is earlier, and does
       lead to a successful overall match, it is chosen.   Since  the  `b'  is
       already spoken for, the `b*' must match its last possibility--the empty
       string--since it must respect the earlier choice.

       In the particular case where no `|'s are present and there is only  one
       `*',  `+',  or  `?',  the net effect is that the longest possible match
       will be  chosen.   So  `ab*',  presented  with  `xabbbby',  will  match
       `abbbb'.   Note  that  if(3,n)  `ab*'  is tried against `xabyabbbz', it will
       match `ab' just after  `x',  due  to  the  begins-earliest  rule.   (In
       effect, the decision on where to start the match is the first choice to
       be made, hence subsequent choices must respect it even  if(3,n)  this  leads
       them to less-preferred alternatives.)

SEE ALSO
       egrep(1), expr(1,3,n)(1)

DIAGNOSTICS
       Regcomp  returns  NULL for a failure (regerror permitting), where fail-
       ures are syntax errors, exceeding implementation  limits,  or  applying
       `+' or `*' to a possibly-null operand.

HISTORY
       Both code and manual page were written at U of T.  They are intended to
       be compatible with the Bell V8 regexp(3,n)(3), but are not derived from Bell
       code.

BUGS
       Empty branches and empty regular expressions are not portable to V8.

       The  restriction against applying `*' or `+' to a possibly-null operand
       is an artifact of the simplistic implementation.

       Does not support egrep's newline-separated branches; neither  does  the
       V8 regexp(3,n)(3), though.

       Due  to  emphasis  on  compactness  and simplicity, it's not strikingly
       fast.  It does give special attention to handling simple cases quickly.



                                     local                           REGEXP(3)

References for this manual (incoming links)