distrib/share/tcl/help/tcl/strings/re_syntax

   1 NAME
   2        re_syntax - Syntax of Tcl regular expressions.
   3
   4
   5 DESCRIPTION
   6        A  regular  expression describes strings of characters.  It's a pattern
   7        that matches certain strings and doesn't match others.
   8
   9
  10 DIFFERENT FLAVORS OF REs
  11        Regular expressions (``RE''s), as defined by POSIX, come  in  two  fla-
  12        vors:  extended  REs  (``EREs'')  and  basic  REs (``BREs'').  EREs are
  13        roughly those of the traditional egrep, while BREs are roughly those of
  14        the  traditional ed.  This implementation adds a third flavor, advanced
  15        REs (``AREs''), basically EREs with some significant extensions.
  16
  17        This manual page primarily describes AREs.  BREs mostly exist for back-
  18        ward  compatibility in some old programs; they will be discussed at the
  19        end.  POSIX EREs are almost an exact subset of AREs.  Features of  AREs
  20        that are not present in EREs will be indicated.
  21
  22
  23 REGULAR EXPRESSION SYNTAX
  24        Tcl  regular  expressions  are implemented using the package written by
  25        Henry Spencer, based on the 1003.2 spec and some (not quite all) of the
  26        Perl5  extensions (thanks, Henry!).  Much of the description of regular
  27        expressions below is copied verbatim from his manual entry.
  28
  29        An ARE is one or more branches, separated  by  `|',  matching  anything
  30        that matches any of the branches.
  31
  32        A branch is zero or more constraints or quantified atoms, concatenated.
  33        It matches a match for the first, followed by a match for  the  second,
  34        etc; an empty branch matches the empty string.
  35
  36        A  quantified atom is an atom possibly followed by a single quantifier.
  37        Without a quantifier, it matches a match for  the  atom.   The  quanti-
  38        fiers, and what a so-quantified atom matches, are:
  39
  40          *     a sequence of 0 or more matches of the atom
  41
  42          +     a sequence of 1 or more matches of the atom
  43
  44          ?     a sequence of 0 or 1 matches of the atom
  45
  46          {m}   a sequence of exactly m matches of the atom
  47
  48          {m,}  a sequence of m or more matches of the atom
  49
  50          {m,n} a  sequence  of  m through n (inclusive) matches of the atom; m
  51                may not exceed n
  52
  53          *?  +?  ??  {m}?  {m,}?  {m,n}?
  54                non-greedy quantifiers, which match the same possibilities, but
  55                prefer  the  smallest  number rather than the largest number of
  56                matches (see MATCHING)
  57
  58        The forms using { and } are known as bounds.  The numbers m and  n  are
  59        unsigned  decimal integers with permissible values from 0 to 255 inclu-
  60        sive.
  61
  62        An atom is one of:
  63
  64          (re)  (where re is any regular expression) matches a  match  for  re,
  65                with the match noted for possible reporting
  66
  67          (?:re)
  68                as  previous, but does no reporting (a ``non-capturing'' set of
  69                parentheses)
  70
  71          ()    matches an empty string, noted for possible reporting
  72
  73          (?:)  matches an empty string, without reporting
  74
  75          [chars]
  76                a bracket expression,  matching  any  one  of  the  chars  (see
  77                BRACKET EXPRESSIONS for more detail)
  78
  79           .    matches any single character
  80
  81          \k    (where  k is a non-alphanumeric character) matches that charac-
  82                ter taken as an ordinary character, e.g. \\ matches a backslash
  83                character
  84
  85          \c    where  c  is  alphanumeric  (possibly followed by other charac-
  86                ters), an escape (AREs only), see ESCAPES below
  87
  88          {     when followed by a character other than a  digit,  matches  the
  89                left-brace  character  `{'; when followed by a digit, it is the
  90                beginning of a bound (see above)
  91
  92          x     where x is a  single  character  with  no  other  significance,
  93                matches that character.
  94
  95        A  constraint matches an empty string when specific conditions are met.
  96        A constraint may not be followed by  a  quantifier.   The  simple  con-
  97        straints  are  as  follows;  some more constraints are described later,
  98        under ESCAPES.
  99
 100          ^       matches at the beginning of a line
 101
 102          $       matches at the end of a line
 103
 104          (?=re)  positive lookahead (AREs only), matches at any point where  a
 105                  substring matching re begins
 106
 107          (?!re)  negative lookahead (AREs only), matches at any point where no
 108                  substring matching re begins
 109
 110        The lookahead constraints may not contain back references (see  later),
 111        and all parentheses within them are considered non-capturing.
 112
 113        An RE may not end with `\'.
 114
 115
 116 BRACKET EXPRESSIONS
 117        A bracket expression is a list of characters enclosed in `[]'.  It nor-
 118        mally matches any single character from the list (but see  below).   If
 119        the  list  begins  with  `^',  it matches any single character (but see
 120        below) not from the rest of the list.
 121
 122        If two characters in the list are separated by `-', this  is  shorthand
 123        for  the  full range of characters between those two (inclusive) in the
 124        collating sequence, e.g.  [0-9] in ASCII  matches  any  decimal  digit.
 125        Two  ranges  may  not  share  an  endpoint,  so e.g.  a-c-e is illegal.
 126        Ranges are very  collating-sequence-dependent,  and  portable  programs
 127        should avoid relying on them.
 128
 129        To  include  a  literal  ]  or - in the list, the simplest method is to
 130        enclose it in [. and .]  to make it a collating  element  (see  below).
 131        Alternatively,  make it the first character (following a possible `^'),
 132        or (AREs only) precede it with `\'.  Alternatively, for  `-',  make  it
 133        the  last  character, or the second endpoint of a range.  To use a lit-
 134        eral - as the first endpoint of a range, make it a collating element or
 135        (AREs  only)  precede  it  with `\'.  With the exception of these, some
 136        combinations using [ (see next paragraphs), and escapes, all other spe-
 137        cial  characters  lose  their  special  significance  within  a bracket
 138        expression.
 139
 140        Within a bracket expression, a collating element (a character, a multi-
 141        character sequence that collates as if it were a single character, or a
 142        collating-sequence name for either) enclosed in [. and .]   stands  for
 143        the  sequence of characters of that collating element.  The sequence is
 144        a single element of the bracket expression's list.  A  bracket  expres-
 145        sion  in  a locale that has multi-character collating elements can thus
 146        match more than one character.  So (insidiously), a bracket  expression
 147        that starts with ^ can match multi-character collating elements even if
 148        none of them appear in the bracket expression!   (Note:  Tcl  currently
 149        has  no  multi-character  collating elements.  This information is only
 150        for illustration.)
 151
 152        For example, assume the collating sequence includes a ch  multi-charac-
 153        ter  collating element.  Then the RE [[.ch.]]*c (zero or more ch's fol-
 154        lowed by c) matches the first five characters of `chchcc'.   Also,  the
 155        RE [^c]b matches all of `chb' (because [^c] matches the multi-character
 156        ch).
 157
 158        Within a bracket expression, a collating element enclosed in [= and  =]
 159        is  an  equivalence  class, standing for the sequences of characters of
 160        all collating elements equivalent to that one, including  itself.   (If
 161        there  are  no other equivalent collating elements, the treatment is as
 162        if the enclosing delimiters were `[.' and `.]'.)  For example, if o and
 163        ^  are  the members of an equivalence class, then `[[=o=]]', `[[=^=]]',
 164        and `[o^]' are all synonymous.  An equivalence class may not be an end-
 165        point  of  a  range.   (Note: Tcl currently implements only the Unicode
 166        locale.  It doesn't define any equivalence classes.  The examples above
 167        are just illustrations.)
 168
 169        Within  a bracket expression, the name of a character class enclosed in
 170        [: and :] stands for the list of all characters (not all collating ele-
 171        ments!)  belonging to that class.  Standard character classes are:
 172
 173               alpha       A letter.
 174               upper       An upper-case letter.
 175               lower       A lower-case letter.
 176               digit       A decimal digit.
 177               xdigit      A hexadecimal digit.
 178               alnum       An alphanumeric (letter or digit).
 179               print       An alphanumeric (same as alnum).
 180               blank       A space or tab character.
 181               space       A character producing white space in displayed text.
 182               punct       A punctuation character.
 183               graph       A character with a visible representation.
 184               cntrl       A control character.
 185
 186        A locale may provide others.  (Note that the current Tcl implementation
 187        has only one locale: the Unicode locale.)  A character class may not be
 188        used as an endpoint of a range.
 189
 190        There are two special cases of bracket expressions: the bracket expres-
 191        sions [[:<:]] and [[:>:]] are constraints, matching  empty  strings  at
 192        the  beginning  and end of a word respectively.  A word is defined as a
 193        sequence of word characters that is neither preceded  nor  followed  by
 194        word  characters.   A word character is an alnum character or an under-
 195        score (_).  These special bracket expressions are deprecated; users  of
 196        AREs should use constraint escapes instead (see below).
 197
 198 ESCAPES
 199        Escapes  (AREs  only), which begin with a \ followed by an alphanumeric
 200        character, come in several varieties:  character  entry,  class  short-
 201        hands,  constraint  escapes,  and  back references.  A \ followed by an
 202        alphanumeric character but not constituting a valid escape  is  illegal
 203        in  AREs.  In EREs, there are no escapes: outside a bracket expression,
 204        a \ followed by an alphanumeric character merely stands for that  char-
 205        acter  as  an ordinary character, and inside a bracket expression, \ is
 206        an ordinary character.  (The latter is the one  actual  incompatibility
 207        between EREs and AREs.)
 208
 209        Character-entry  escapes (AREs only) exist to make it easier to specify
 210        non-printing and otherwise inconvenient characters in REs:
 211
 212          \a   alert (bell) character, as in C
 213
 214          \b   backspace, as in C
 215
 216          \B   synonym for \ to help reduce backslash doubling in some applica-
 217               tions where there are multiple levels of backslash processing
 218
 219          \cX  (where  X is any character) the character whose low-order 5 bits
 220               are the same as those of X, and whose other bits are all zero
 221
 222          \e   the character whose collating-sequence name is `ESC', or failing
 223               that, the character with octal value 033
 224
 225          \f   formfeed, as in C
 226
 227          \n   newline, as in C
 228
 229          \r   carriage return, as in C
 230
 231          \t   horizontal tab, as in C
 232
 233          \uwxyz
 234               (where  wxyz  is  exactly  four  hexadecimal digits) the Unicode
 235               character U+wxyz in the local byte ordering
 236
 237          \Ustuvwxyz
 238               (where stuvwxyz is exactly eight  hexadecimal  digits)  reserved
 239               for a somewhat-hypothetical Unicode extension to 32 bits
 240
 241          \v   vertical tab, as in C are all available.
 242
 243          \xhhh
 244               (where  hhh is any sequence of hexadecimal digits) the character
 245               whose hexadecimal value is 0xhhh (a single character  no  matter
 246               how many hexadecimal digits are used).
 247
 248          \0   the character whose value is 0
 249
 250          \xy  (where  xy is exactly two octal digits, and is not a back refer-
 251               ence (see below)) the character whose octal value is 0xy
 252
 253          \xyz (where xyz is exactly three octal digits, and is not a back ref-
 254               erence (see below)) the character whose octal value is 0xyz
 255
 256        Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'.  Octal digits are
 257        `0'-`7'.
 258
 259        The character-entry escapes are always taken  as  ordinary  characters.
 260        For  example, \135 is ] in ASCII, but \135 does not terminate a bracket
 261        expression.  Beware, however, that some applications (e.g.,  C  compil-
 262        ers)  interpret such sequences themselves before the regular-expression
 263        package gets to see them,  which  may  require  doubling  (quadrupling,
 264        etc.) the `\'.
 265
 266        Class-shorthand escapes (AREs only) provide shorthands for certain com-
 267        monly-used character classes:
 268
 269          \d        [[:digit:]]
 270
 271          \s        [[:space:]]
 272
 273          \w        [[:alnum:]_] (note underscore)
 274
 275          \D        [^[:digit:]]
 276
 277          \S        [^[:space:]]
 278
 279          \W        [^[:alnum:]_] (note underscore)
 280
 281        Within bracket expressions, `\d',  `\s',  and  `\w'  lose  their  outer
 282        brackets,  and `\D', `\S', and `\W' are illegal.  (So, for example, [a-
 283        c\d] is equivalent to [a-c[:digit:]].  Also, [a-c\D], which is  equiva-
 284        lent to [a-c^[:digit:]], is illegal.)
 285
 286        A  constraint  escape  (AREs  only) is a constraint, matching the empty
 287        string if specific conditions are met, written as an escape:
 288
 289          \A    matches only at the beginning  of  the  string  (see  MATCHING,
 290                below, for how this differs from `^')
 291
 292          \m    matches only at the beginning of a word
 293
 294          \M    matches only at the end of a word
 295
 296          \y    matches only at the beginning or end of a word
 297
 298          \Y    matches  only  at a point that is not the beginning or end of a
 299                word
 300
 301          \Z    matches only at the end of the string (see MATCHING, below, for
 302                how this differs from `$')
 303
 304          \m    (where m is a nonzero digit) a back reference, see below
 305
 306          \mnn  (where  m  is  a nonzero digit, and nn is some more digits, and
 307                the decimal value mnn is not greater than the number of closing
 308                capturing parentheses seen so far) a back reference, see below
 309
 310        A word is defined as in the specification of [[:<:]] and [[:>:]] above.
 311        Constraint escapes are illegal within bracket expressions.
 312
 313        A back reference (AREs only) matches the same  string  matched  by  the
 314        parenthesized  subexpression  specified  by  the number, so that (e.g.)
 315        ([bc])\1 matches bb  or  cc  but  not  `bc'.   The  subexpression  must
 316        entirely precede the back reference in the RE.  Subexpressions are num-
 317        bered in the order of their leading parentheses.  Non-capturing  paren-
 318        theses do not define subexpressions.
 319
 320        There is an inherent historical ambiguity between octal character-entry
 321        escapes and back references, which is resolved by heuristics, as hinted
 322        at  above.   A leading zero always indicates an octal escape.  A single
 323        non-zero digit, not followed by another digit, is  always  taken  as  a
 324        back  reference.   A  multi-digit  sequence not starting with a zero is
 325        taken as a back reference if it comes after  a  suitable  subexpression
 326        (i.e.  the number is in the legal range for a back reference), and oth-
 327        erwise is taken as octal.
 328
 329 METASYNTAX
 330        In addition to the main syntax described above, there are some  special
 331        forms and miscellaneous syntactic facilities available.
 332
 333        Normally the flavor of RE being used is specified by application-depen-
 334        dent means.  However, this can be overridden by a director.  If  an  RE
 335        of  any flavor begins with `***:', the rest of the RE is an ARE.  If an
 336        RE of any flavor begins with `***=', the rest of the RE is taken to  be
 337        a literal string, with all characters considered ordinary characters.
 338
 339        An ARE may begin with embedded options: a sequence (?xyz) (where xyz is
 340        one or more alphabetic characters) specifies options affecting the rest
 341        of  the  RE.  These supplement, and can override, any options specified
 342        by the application.  The available option letters are:
 343
 344          b  rest of RE is a BRE
 345
 346          c  case-sensitive matching (usual default)
 347
 348          e  rest of RE is an ERE
 349
 350          i  case-insensitive matching (see MATCHING, below)
 351
 352          m  historical synonym for n
 353
 354          n  newline-sensitive matching (see MATCHING, below)
 355
 356          p  partial newline-sensitive matching (see MATCHING, below)
 357
 358          q  rest of RE is a literal (``quoted'') string, all ordinary  charac-
 359             ters
 360
 361          s  non-newline-sensitive matching (usual default)
 362
 363          t  tight syntax (usual default; see below)
 364
 365          w  inverse partial newline-sensitive (``weird'') matching (see MATCH-
 366             ING, below)
 367
 368          x  expanded syntax (see below)
 369
 370        Embedded options take effect at the ) terminating the  sequence.   They
 371        are  available  only  at the start of an ARE, and may not be used later
 372        within it.
 373
 374        In addition to the usual (tight) RE syntax, in which all characters are
 375        significant,  there  is an expanded syntax, available in all flavors of
 376        RE with the -expanded switch, or in AREs with the  embedded  x  option.
 377        In  the  expanded  syntax,  white-space  characters are ignored and all
 378        characters between a # and the following newline (or the end of the RE)
 379        are  ignored,  permitting  paragraphing  and  commenting  a complex RE.
 380        There are three exceptions to that basic rule:
 381
 382          a white-space character or `#' preceded by `\' is retained
 383
 384          white space or `#' within a bracket expression is retained
 385
 386          white space and comments are illegal within  multi-character  symbols
 387          like the ARE `(?:' or the BRE `\('
 388
 389        Expanded-syntax white-space characters are blank, tab, newline, and any
 390        character that belongs to the space character class.
 391
 392        Finally, in an ARE, outside bracket expressions, the sequence `(?#ttt)'
 393        (where  ttt  is any text not containing a `)') is a comment, completely
 394        ignored.  Again, this is not allowed between the characters  of  multi-
 395        character  symbols  like  `(?:'.   Such  comments are more a historical
 396        artifact than a useful facility, and their use is deprecated;  use  the
 397        expanded syntax instead.
 398
 399        None of these metasyntax extensions is available if the application (or
 400        an initial ***= director)  has  specified  that  the  user's  input  be
 401        treated as a literal string rather than as an RE.
 402
 403 MATCHING
 404        In  the event that an RE could match more than one substring of a given
 405        string, the RE matches the one starting earliest in the string.  If the
 406        RE  could  match  more  than  one substring starting at that point, its
 407        choice is determined by its preference: either the  longest  substring,
 408        or the shortest.
 409
 410        Most  atoms,  and all constraints, have no preference.  A parenthesized
 411        RE has the same preference (possibly none) as  the  RE.   A  quantified
 412        atom  with  quantifier  {m}  or {m}?  has the same preference (possibly
 413        none) as the atom itself.  A quantified atom with other normal  quanti-
 414        fiers  (including  {m,n}  with  m equal to n) prefers longest match.  A
 415        quantified atom with other  non-greedy  quantifiers  (including  {m,n}?
 416        with m equal to n) prefers shortest match.  A branch has the same pref-
 417        erence as the first quantified atom in it which has a  preference.   An
 418        RE  consisting  of  two  or  more  branches connected by the | operator
 419        prefers longest match.
 420
 421        Subject to the constraints imposed by the rules for matching the  whole
 422        RE,  subexpressions  also  match  the longest or shortest possible sub-
 423        strings, based on their preferences, with subexpressions starting  ear-
 424        lier  in  the  RE  taking priority over ones starting later.  Note that
 425        outer subexpressions thus take priority over their component subexpres-
 426        sions.
 427
 428        Note  that the quantifiers {1,1} and {1,1}?  can be used to force long-
 429        est and shortest preference, respectively,  on  a  subexpression  or  a
 430        whole RE.
 431
 432        Match  lengths  are measured in characters, not collating elements.  An
 433        empty string is considered longer than no match at all.   For  example,
 434        bb*    matches    the    three    middle    characters    of   `abbbc',
 435        (week|wee)(night|knights) matches all ten characters  of  `weeknights',
 436        when  (.*).*   is  matched  against abc the parenthesized subexpression
 437        matches all three characters, and when (a*)* is matched against bc both
 438        the whole RE and the parenthesized subexpression match an empty string.
 439
 440        If case-independent matching is specified, the effect is much as if all
 441        case  distinctions  had vanished from the alphabet.  When an alphabetic
 442        that exists in multiple cases appears as an ordinary character  outside
 443        a  bracket  expression,  it  is  effectively transformed into a bracket
 444        expression containing both cases, so that x becomes  `[xX]'.   When  it
 445        appears  inside  a  bracket expression, all case counterparts of it are
 446        added to the bracket expression, so that  [x]  becomes  [xX]  and  [^x]
 447        becomes `[^xX]'.
 448
 449        If  newline-sensitive matching is specified, .  and bracket expressions
 450        using ^ will never match the newline character (so  that  matches  will
 451        never  cross newlines unless the RE explicitly arranges it) and ^ and $
 452        will match the empty string after and before a newline respectively, in
 453        addition  to matching at beginning and end of string respectively.  ARE
 454        \A and \Z continue to match beginning or end of string only.
 455
 456        If partial newline-sensitive matching is specified, this affects .  and
 457        bracket  expressions  as with newline-sensitive matching, but not ^ and
 458        `$'.
 459
 460        If  inverse  partial  newline-sensitive  matching  is  specified,  this
 461        affects  ^  and  $  as  with newline-sensitive matching, but not .  and
 462        bracket expressions.  This isn't very useful but is provided for symme-
 463        try.
 464
 465 LIMITS AND COMPATIBILITY
 466        No particular limit is imposed on the length of REs.  Programs intended
 467        to be highly portable should not employ REs longer than 256 bytes, as a
 468        POSIX-compliant implementation can refuse to accept such REs.
 469
 470        The  only feature of AREs that is actually incompatible with POSIX EREs
 471        is that \ does not lose its special significance inside bracket expres-
 472        sions.  All other ARE features use syntax which is illegal or has unde-
 473        fined or unspecified effects in POSIX EREs; the *** syntax of directors
 474        likewise is outside the POSIX syntax for both BREs and EREs.
 475
 476        Many  of  the ARE extensions are borrowed from Perl, but some have been
 477        changed to clean them up, and a few Perl extensions  are  not  present.
 478        Incompatibilities  of  note  include  `\b',  `\B',  the lack of special
 479        treatment for a trailing newline, the addition of complemented  bracket
 480        expressions  to  the things affected by newline-sensitive matching, the
 481        restrictions on parentheses  and  back  references  in  lookahead  con-
 482        straints,  and  the  longest/shortest-match  (rather  than first-match)
 483        matching semantics.
 484
 485        The matching rules for REs containing both normal and non-greedy  quan-
 486        tifiers  have  changed  since early beta-test versions of this package.
 487        (The new rules are much simpler and cleaner, but don't work as hard  at
 488        guessing the user's real intentions.)
 489
 490        Henry  Spencer's  original 1986 regexp package, still in widespread use
 491        (e.g., in pre-8.1 releases of Tcl), implemented  an  early  version  of
 492        today's  EREs.  There are four incompatibilities between regexp's near-
 493        EREs (`RREs' for short) and AREs.  In roughly increasing order of  sig-
 494        nificance:
 495
 496               In  AREs,  \  followed by an alphanumeric character is either an
 497               escape or an error, while in RREs, it was just  another  way  of
 498               writing  the alphanumeric.  This should not be a problem because
 499               there was no reason to write such a sequence in RREs.
 500
 501               { followed by a digit in an ARE is the  beginning  of  a  bound,
 502               while  in  RREs,  {  was  always  an  ordinary  character.  Such
 503               sequences should be rare, and will  often  result  in  an  error
 504               because following characters will not look like a valid bound.
 505
 506               In AREs, \ remains a special character within `[]', so a literal
 507               \ within [] must be written `\\'.  \\ also  gives  a  literal  \
 508               within [] in RREs, but only truly paranoid programmers routinely
 509               doubled the backslash.
 510
 511               AREs report the longest/shortest match for the RE,  rather  than
 512               the  first  found  in a specified search order.  This may affect
 513               some RREs which were written in the expectation that  the  first
 514               match would be reported.  (The careful crafting of RREs to opti-
 515               mize the search order for fast matching is obsolete (AREs  exam-
 516               ine  all  possible matches in parallel, and their performance is
 517               largely insensitive to their complexity)  but  cases  where  the
 518               search  order  was  exploited to deliberately find a match which
 519               was not the longest/shortest will need rewriting.)
 520
 521
 522 BASIC REGULAR EXPRESSIONS
 523        BREs differ from EREs in several respects.  `|', `+', and ?  are  ordi-
 524        nary  characters  and  there  is no equivalent for their functionality.
 525        The delimiters for bounds are \{ and `\}', with { and }  by  themselves
 526        ordinary  characters.  The parentheses for nested subexpressions are \(
 527        and `\)', with ( and ) by themselves  ordinary  characters.   ^  is  an
 528        ordinary  character  except at the beginning of the RE or the beginning
 529        of a parenthesized subexpression, $ is an ordinary character except  at
 530        the end of the RE or the end of a parenthesized subexpression, and * is
 531        an ordinary character if it appears at the beginning of the RE  or  the
 532        beginning  of  a  parenthesized subexpression (after a possible leading
 533        `^').  Finally, single-digit back references are available, and \<  and
 534        \>  are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes
 535        are available.
 536
 537
 538 SEE ALSO
 539        RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
 540
 541
 542 KEYWORDS