Seth Woolley's Man Viewer

Manual for regex - man regex

([section] manual, -k keyword, -K [section] search, -f whatis)
man plain no title

REGEX(3)                                                              REGEX(3)



NAME
       regcomp, regexec, regerror, regfree - regular-expression library

SYNOPSIS
       #include <sys/types.h>
       #include <regex.h>

       int regcomp(regex_t *preg, const char *pattern, int cflags);

       int regexec(const regex_t *preg,   const char *string(3,n),   size_t nmatch,
                 regmatch_t pmatch[], int eflags);

       size_t regerror(int errcode,     const regex_t *preg,     char *errbuf,
                 size_t errbuf_size);

       void regfree(regex_t *preg);

DESCRIPTION
       These  routines  implement  POSIX 1003.2 regular expressions (``RE''s);
       see regex(3,7)(7).  Regcomp compiles an RE  written  as  a  string(3,n)  into  an
       internal  form, regexec matches that internal form against a string(3,n) and
       reports results, regerror  transforms  error(8,n)  codes  from  either  into
       human-readable  messages,  and  regfree frees any dynamically-allocated
       storage used by the internal form of an RE.

       The header <regex.h> declares two structure  types,  regex_t  and  reg-
       match_t,  the  former  for  compiled  internal forms and the latter for
       match reporting.  It also declares the four functions, a type regoff_t,
       and a number of constants with names starting with ``REG_''.

       Regcomp  compiles  the  regular  expression  contained  in(1,8)  the pattern
       string(3,n), subject to the flags in(1,8) cflags, and places the results  in(1,8)  the
       regex_t structure pointed to by preg.  Cflags is the bitwise OR of zero
       or more of the following flags:

       REG_EXTENDED  Compile modern (``extended'') REs, rather than the  obso-
                     lete (``basic'') REs that are the default.

       REG_BASIC     This  is  a  synonym  for 0, provided as a counterpart to
                     REG_EXTENDED to improve readability.

       REG_NOSPEC    Compile with recognition of all special characters turned
                     off.  All characters are thus considered ordinary, so the
                     ``RE'' is a literal string.  This is an  extension,  com-
                     patible  with  but  not  specified  by  POSIX 1003.2, and
                     should be used with caution in(1,8) software  intended  to  be
                     portable  to  other systems.  REG_EXTENDED and REG_NOSPEC
                     may not be used in(1,8) the same call to regcomp.

       REG_ICASE     Compile for matching that ignores upper/lower  case  dis-
                     tinctions.  See regex(3,7)(7).

       REG_NOSUB     Compile  for  matching  that  need only report success or
                     failure, not what was matched.

       REG_NEWLINE   Compile for newline-sensitive matching.  By default, new-
                     line  is  a completely ordinary character with no special
                     meaning in(1,8) either REs or strings.  With this  flag,  `[^'
                     bracket  expressions  and  `.' never match newline, a `^'
                     anchor matches the null string(3,n) after any newline  in(1,8)  the
                     string(3,n)  in(1,8)  addition  to its normal function, and the `$'
                     anchor matches the null string(3,n) before any newline in(1,8)  the
                     string(3,n) in(1,8) addition to its normal function.

       REG_PEND      The  regular  expression  ends, not at the first NUL, but
                     just before the character pointed to by the re_endp  mem-
                     ber  of  the  structure  pointed to by preg.  The re_endp
                     member is of type const char *.  This flag permits inclu-
                     sion  of  NULs  in(1,8)  the  RE; they are considered ordinary
                     characters.  This is an extension,  compatible  with  but
                     not  specified  by  POSIX 1003.2, and should be used with
                     caution in(1,8) software intended to be portable to other sys-
                     tems.

       When  successful,  regcomp returns 0 and fills in(1,8) the structure pointed
       to by preg.  One member of that structure (other than re_endp) is  pub-
       licized:  re_nsub, of type size_t, contains the number of parenthesized
       subexpressions within the RE (except that the value of this  member  is
       undefined  if(3,n)  the  REG_NOSUB  flag  was  used).   If regcomp fails, it
       returns a non-zero error(8,n) code; see DIAGNOSTICS.

       Regexec matches the compiled RE pointed to by preg against the  string(3,n),
       subject  to  the  flags  in(1,8)  eflags,  and reports results using nmatch,
       pmatch, and the returned value.  The RE must have been  compiled  by  a
       previous  invocation of regcomp.  The compiled form is not altered dur-
       ing execution of regexec, so a single compiled RE can be used  simulta-
       neously by multiple threads.

       By  default,  the NUL-terminated string(3,n) pointed to by string(3,n) is consid-
       ered to be the text of an entire line, minus any  terminating  newline.
       The  eflags argument is the bitwise OR of zero or more of the following
       flags:

       REG_NOTBOL    The first character of the string(3,n) is not the beginning of
                     a  line,  so  the  `^' anchor should not match before it.
                     This does not  affect  the  behavior  of  newlines  under
                     REG_NEWLINE.

       REG_NOTEOL    The  NUL  terminating  the string(3,n) does not end a line, so
                     the `$' anchor should not match before it.  This does not
                     affect the behavior of newlines under REG_NEWLINE.

       REG_STARTEND  The   string(3,n)   is   considered   to   start  at  string(3,n) +
                     pmatch[0].rm_so and to have a terminating NUL located  at
                     string(3,n) +  pmatch[0].rm_eo  (there  need not actually be a
                     NUL at that location), regardless of the value of nmatch.
                     See  below for the definition of pmatch and nmatch.  This
                     is an extension, compatible with  but  not  specified  by
                     POSIX 1003.2, and should be used with caution in(1,8) software
                     intended to be portable to other systems.   Note  that  a
                     non-zero  rm_so  does  not imply REG_NOTBOL; REG_STARTEND
                     affects only the location of the string(3,n), not  how  it  is
                     matched.

       See regex(3,7)(7) for a discussion of what is matched in(1,8) situations where an
       RE or a portion thereof  could  match  any  of  several  substrings  of
       string(3,n).

       Normally,   regexec  returns  0  for  success  and  the  non-zero  code
       REG_NOMATCH for failure.  Other non-zero error(8,n) codes may be returned in(1,8)
       exceptional situations; see DIAGNOSTICS.

       If  REG_NOSUB  was specified in(1,8) the compilation of the RE, or if(3,n) nmatch
       is 0, regexec ignores the pmatch argument (but see below for  the  case
       where REG_STARTEND is specified).  Otherwise, pmatch points to an array
       of nmatch structures of type regmatch_t.  Such a structure has at least
       the members rm_so and rm_eo, both of type regoff_t (a signed arithmetic
       type at least as large as an off_t and a ssize_t),  containing  respec-
       tively  the offset of the first character of a substring and the offset
       of the first character after the end of  the  substring.   Offsets  are
       measured  from  the  beginning of the string(3,n) argument given to regexec.
       An empty substring is denoted by equal  offsets,  both  indicating  the
       character following the empty substring.

       The  0th  member of the pmatch array is filled in(1,8) to indicate what sub-
       string(3,n) of string(3,n) was matched  by  the  entire  RE.   Remaining  members
       report  what  substring  was  matched  by  parenthesized subexpressions
       within the RE; member i reports subexpression  i,  with  subexpressions
       counted  (starting  at  1) by the order of their opening parentheses in(1,8)
       the RE, left to right.   Unused  entries  in(1,8)  the  array--corresponding
       either  to subexpressions that did not participate in(1,8) the match at all,
       or to subexpressions that  do  not  exist  in(1,8)  the  RE  (that  is,  i >
       preg->re_nsub)--have  both  rm_so and rm_eo set(7,n,1 builtins) to -1.  If a subexpres-
       sion participated in(1,8) the match several times, the reported substring is
       the last one it matched.  (Note, as an example in(1,8) particular, that when
       the RE `(b*)+' matches `bbb', the parenthesized  subexpression  matches
       each  of  the  three  `b's and then an infinite number of empty strings
       following the last `b', so the reported substring is one  of  the  emp-
       ties.)

       If  REG_STARTEND  is  specified, pmatch must point to at least one reg-
       match_t (even if(3,n) nmatch is 0 or REG_NOSUB was specified), to  hold  the
       input  offsets for REG_STARTEND.  Use for output is still entirely con-
       trolled by nmatch; if(3,n) nmatch is 0 or REG_NOSUB was specified, the value
       of pmatch[0] will not be changed by a successful regexec.

       Regerror  maps  a  non-zero errcode from either regcomp or regexec to a
       human-readable, printable message.  If preg is non-NULL, the error(8,n) code
       should  have  arisen from use of the regex_t pointed to by preg, and if(3,n)
       the error(8,n) code came from regcomp, it should have been the  result  from
       the  most  recent regcomp using that regex_t.  (Regerror may be able to
       supply a more detailed message using  information  from  the  regex_t.)
       Regerror  places  the NUL-terminated message into the buffer pointed to
       by  errbuf,  limiting  the  length  (including  the  NUL)  to  at  most
       errbuf_size  bytes.   If  the whole message won't fit, as much of it as
       will fit before the terminating NUL is  supplied.   In  any  case,  the
       returned  value  is the size of buffer needed to hold the whole message
       (including terminating NUL).  If errbuf_size is 0,  errbuf  is  ignored
       but the return value is still correct.

       If  the  errcode  given  to  regerror  is first ORed with REG_ITOA, the
       ``message'' that results is the printable name of the error(8,n) code,  e.g.
       ``REG_NOMATCH'',  rather  than  an  explanation thereof.  If errcode is
       REG_ATOI, then preg shall be non-NULL and the  re_endp  member  of  the
       structure  it  points  to  must point to the printable name of an error(8,n)
       code; in(1,8) this case, the result in(1,8) errbuf is the decimal digits  of  the
       numeric  value  of  the  error(8,n)  code (0 if(3,n) the name is not recognized).
       REG_ITOA and REG_ATOI are intended primarily as  debugging  facilities;
       they are extensions, compatible with but not specified by POSIX 1003.2,
       and should be used with caution in(1,8) software intended to be portable  to
       other  systems.   Be  warned also that they are considered experimental
       and changes are possible.

       Regfree frees any dynamically-allocated  storage  associated  with  the
       compiled  RE  pointed to by preg.  The remaining regex_t is no longer a
       valid compiled RE and the effect of supplying it to regexec or regerror
       is undefined.

       None  of  these functions references global variables except for tables
       of constants; all are safe for use from multiple threads if(3,n)  the  argu-
       ments are safe.

IMPLEMENTATION CHOICES
       There  are a number of decisions that 1003.2 leaves up to the implemen-
       tor, either by explicitly saying ``undefined'' or  by  virtue  of  them
       being  forbidden by the RE grammar.  This implementation treats them as
       follows.

       See regex(3,7)(7) for a discussion of  the  definition  of  case-independent
       matching.

       There  is  no  particular limit on the length of REs, except insofar as
       memory is limited.  Memory usage is approximately linear  in(1,8)  RE  size,
       and  largely  insensitive  to RE complexity, except for bounded repeti-
       tions.  See BUGS for one short RE using them that will run  almost  any
       system out of memory.

       A backslashed character other than one specifically given a magic(4,5) mean-
       ing by 1003.2 (such magic(4,5) meanings occur only in(1,8)  obsolete  [``basic'']
       REs) is taken as an ordinary character.

       Any unmatched [ is a REG_EBRACK error.

       Equivalence classes cannot begin or end bracket-expression ranges.  The
       endpoint of one range cannot begin another.

       RE_DUP_MAX, the limit on repetition counts in(1,8) bounded  repetitions,  is
       255.

       A  repetition operator (?, *, +, or bounds) cannot follow another repe-
       tition operator.  A repetition operator cannot begin an  expression  or
       subexpression or follow `^' or `|'.

       `|'  cannot  appear first or last in(1,8) a (sub)expression or after another
       `|', i.e. an operand of `|' cannot be an empty subexpression.  An empty
       parenthesized  subexpression,  `()',  is  legal  and  matches  an empty
       (sub)string.  An empty string(3,n) is not a legal RE.

       A `{' followed by a digit is considered the beginning of bounds  for  a
       bounded  repetition,  which  must then follow the syntax for bounds.  A
       `{' not followed by a digit is considered an ordinary character.

       `^' and `$' beginning and ending subexpressions in(1,8) obsolete (``basic'')
       REs are anchors, not ordinary characters.

SEE ALSO
       grep(1), regex(3,7)(7)

       POSIX  1003.2,  sections  2.8  (Regular Expression Notation) and B.5 (C
       Binding for Regular Expression Matching).

DIAGNOSTICS
       Non-zero error(8,n) codes from regcomp and regexec include the following:

       REG_NOMATCH    regexec() failed to match
       REG_BADPAT     invalid regular expression
       REG_ECOLLATE   invalid collating element
       REG_ECTYPE     invalid character class
       REG_EESCAPE    \ applied to unescapable character
       REG_ESUBREG    invalid backreference number
       REG_EBRACK     brackets [ ] not balanced
       REG_EPAREN     parentheses ( ) not balanced
       REG_EBRACE     braces { } not balanced
       REG_BADBR      invalid repetition count(s) in(1,8) { }
       REG_ERANGE     invalid character range in(1,8) [ ]
       REG_ESPACE     ran out of memory
       REG_BADRPT     ?, *, or + operand invalid
       REG_EMPTY      empty (sub)expression
       REG_ASSERT     ``can't happen''--you found a bug
       REG_INVARG     invalid argument, e.g. negative-length string(3,n)

HISTORY
       Written    by    Henry    Spencer    at    University    of    Toronto,
       henry@zoo.toronto.edu.

BUGS
       This is an alpha release with known defects.  Please report problems.

       There  is  one known functionality bug.  The implementation of interna-
       tionalization is incomplete: the locale(3,5,7) is always  assumed  to  be  the
       default  one  of  1003.2,  and only the collating elements etc. of that
       locale(3,5,7) are available.

       The back-reference code is subtle and doubts linger about its  correct-
       ness in(1,8) complex cases.

       Regexec  performance  is  poor.  This will improve with later releases.
       Nmatch exceeding 0 is expensive; nmatch exceeding 1 is worse.   Regexec
       is largely insensitive to RE complexity except that back references are
       massively expensive.  RE length does matter; in(1,8) particular, there is  a
       strong  speed  bonus  for  keeping RE length under about 30 characters,
       with most special characters counting roughly double.

       Regcomp implements bounded repetitions by  macro  expansion,  which  is
       costly in(1,8) time(1,2,n) and space if(3,n) counts are large or bounded repetitions are
       nested.             An            RE             like,             say,
       `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'  will  (eventually)  run
       almost any existing machine out of swap space.

       There are suspected problems with response to obscure error(8,n) conditions.
       Notably,  certain  kinds  of  internal overflow, produced only by truly
       enormous REs or by multiply nested bounded  repetitions,  are  probably
       not handled well.

       Due to a mistake in(1,8) 1003.2, things like `a)b' are legal REs because `)'
       is a special character only in(1,8) the presence  of  a  previous  unmatched
       `('.  This can't be fixed until the spec is fixed.

       The  standard's  definition  of back references is vague.  For example,
       does `a\(\(b\)*\2\)*d' match `abbbd'?  Until the standard is clarified,
       behavior in(1,8) such cases should not be relied on.

       The  implementation of word-boundary matching is a bit of a kludge, and
       bugs may lurk in(1,8) combinations of word-boundary matching and  anchoring.



                                  17 May 1993                         REGEX(3)

References for this manual (incoming links)