Initial version of donated sources by Avertec, 3.4p5.
[tas-yagle.git] / distrib / share / tcl / help / tcl / strings / re_syntax
1 NAME
2 re_syntax - Syntax of Tcl regular expressions.
3
4
5 DESCRIPTION
6 A regular expression describes strings of characters. It's a pattern
7 that matches certain strings and doesn't match others.
8
9
10 DIFFERENT FLAVORS OF REs
11 Regular expressions (``RE''s), as defined by POSIX, come in two fla-
12 vors: extended REs (``EREs'') and basic REs (``BREs''). EREs are
13 roughly those of the traditional egrep, while BREs are roughly those of
14 the traditional ed. This implementation adds a third flavor, advanced
15 REs (``AREs''), basically EREs with some significant extensions.
16
17 This manual page primarily describes AREs. BREs mostly exist for back-
18 ward compatibility in some old programs; they will be discussed at the
19 end. POSIX EREs are almost an exact subset of AREs. Features of AREs
20 that are not present in EREs will be indicated.
21
22
23 REGULAR EXPRESSION SYNTAX
24 Tcl regular expressions are implemented using the package written by
25 Henry Spencer, based on the 1003.2 spec and some (not quite all) of the
26 Perl5 extensions (thanks, Henry!). Much of the description of regular
27 expressions below is copied verbatim from his manual entry.
28
29 An ARE is one or more branches, separated by `|', matching anything
30 that matches any of the branches.
31
32 A branch is zero or more constraints or quantified atoms, concatenated.
33 It matches a match for the first, followed by a match for the second,
34 etc; an empty branch matches the empty string.
35
36 A quantified atom is an atom possibly followed by a single quantifier.
37 Without a quantifier, it matches a match for the atom. The quanti-
38 fiers, and what a so-quantified atom matches, are:
39
40 * a sequence of 0 or more matches of the atom
41
42 + a sequence of 1 or more matches of the atom
43
44 ? a sequence of 0 or 1 matches of the atom
45
46 {m} a sequence of exactly m matches of the atom
47
48 {m,} a sequence of m or more matches of the atom
49
50 {m,n} a sequence of m through n (inclusive) matches of the atom; m
51 may not exceed n
52
53 *? +? ?? {m}? {m,}? {m,n}?
54 non-greedy quantifiers, which match the same possibilities, but
55 prefer the smallest number rather than the largest number of
56 matches (see MATCHING)
57
58 The forms using { and } are known as bounds. The numbers m and n are
59 unsigned decimal integers with permissible values from 0 to 255 inclu-
60 sive.
61
62 An atom is one of:
63
64 (re) (where re is any regular expression) matches a match for re,
65 with the match noted for possible reporting
66
67 (?:re)
68 as previous, but does no reporting (a ``non-capturing'' set of
69 parentheses)
70
71 () matches an empty string, noted for possible reporting
72
73 (?:) matches an empty string, without reporting
74
75 [chars]
76 a bracket expression, matching any one of the chars (see
77 BRACKET EXPRESSIONS for more detail)
78
79 . matches any single character
80
81 \k (where k is a non-alphanumeric character) matches that charac-
82 ter taken as an ordinary character, e.g. \\ matches a backslash
83 character
84
85 \c where c is alphanumeric (possibly followed by other charac-
86 ters), an escape (AREs only), see ESCAPES below
87
88 { when followed by a character other than a digit, matches the
89 left-brace character `{'; when followed by a digit, it is the
90 beginning of a bound (see above)
91
92 x where x is a single character with no other significance,
93 matches that character.
94
95 A constraint matches an empty string when specific conditions are met.
96 A constraint may not be followed by a quantifier. The simple con-
97 straints are as follows; some more constraints are described later,
98 under ESCAPES.
99
100 ^ matches at the beginning of a line
101
102 $ matches at the end of a line
103
104 (?=re) positive lookahead (AREs only), matches at any point where a
105 substring matching re begins
106
107 (?!re) negative lookahead (AREs only), matches at any point where no
108 substring matching re begins
109
110 The lookahead constraints may not contain back references (see later),
111 and all parentheses within them are considered non-capturing.
112
113 An RE may not end with `\'.
114
115
116 BRACKET EXPRESSIONS
117 A bracket expression is a list of characters enclosed in `[]'. It nor-
118 mally matches any single character from the list (but see below). If
119 the list begins with `^', it matches any single character (but see
120 below) not from the rest of the list.
121
122 If two characters in the list are separated by `-', this is shorthand
123 for the full range of characters between those two (inclusive) in the
124 collating sequence, e.g. [0-9] in ASCII matches any decimal digit.
125 Two ranges may not share an endpoint, so e.g. a-c-e is illegal.
126 Ranges are very collating-sequence-dependent, and portable programs
127 should avoid relying on them.
128
129 To include a literal ] or - in the list, the simplest method is to
130 enclose it in [. and .] to make it a collating element (see below).
131 Alternatively, make it the first character (following a possible `^'),
132 or (AREs only) precede it with `\'. Alternatively, for `-', make it
133 the last character, or the second endpoint of a range. To use a lit-
134 eral - as the first endpoint of a range, make it a collating element or
135 (AREs only) precede it with `\'. With the exception of these, some
136 combinations using [ (see next paragraphs), and escapes, all other spe-
137 cial characters lose their special significance within a bracket
138 expression.
139
140 Within a bracket expression, a collating element (a character, a multi-
141 character sequence that collates as if it were a single character, or a
142 collating-sequence name for either) enclosed in [. and .] stands for
143 the sequence of characters of that collating element. The sequence is
144 a single element of the bracket expression's list. A bracket expres-
145 sion in a locale that has multi-character collating elements can thus
146 match more than one character. So (insidiously), a bracket expression
147 that starts with ^ can match multi-character collating elements even if
148 none of them appear in the bracket expression! (Note: Tcl currently
149 has no multi-character collating elements. This information is only
150 for illustration.)
151
152 For example, assume the collating sequence includes a ch multi-charac-
153 ter collating element. Then the RE [[.ch.]]*c (zero or more ch's fol-
154 lowed by c) matches the first five characters of `chchcc'. Also, the
155 RE [^c]b matches all of `chb' (because [^c] matches the multi-character
156 ch).
157
158 Within a bracket expression, a collating element enclosed in [= and =]
159 is an equivalence class, standing for the sequences of characters of
160 all collating elements equivalent to that one, including itself. (If
161 there are no other equivalent collating elements, the treatment is as
162 if the enclosing delimiters were `[.' and `.]'.) For example, if o and
163 ^ are the members of an equivalence class, then `[[=o=]]', `[[=^=]]',
164 and `[o^]' are all synonymous. An equivalence class may not be an end-
165 point of a range. (Note: Tcl currently implements only the Unicode
166 locale. It doesn't define any equivalence classes. The examples above
167 are just illustrations.)
168
169 Within a bracket expression, the name of a character class enclosed in
170 [: and :] stands for the list of all characters (not all collating ele-
171 ments!) belonging to that class. Standard character classes are:
172
173 alpha A letter.
174 upper An upper-case letter.
175 lower A lower-case letter.
176 digit A decimal digit.
177 xdigit A hexadecimal digit.
178 alnum An alphanumeric (letter or digit).
179 print An alphanumeric (same as alnum).
180 blank A space or tab character.
181 space A character producing white space in displayed text.
182 punct A punctuation character.
183 graph A character with a visible representation.
184 cntrl A control character.
185
186 A locale may provide others. (Note that the current Tcl implementation
187 has only one locale: the Unicode locale.) A character class may not be
188 used as an endpoint of a range.
189
190 There are two special cases of bracket expressions: the bracket expres-
191 sions [[:<:]] and [[:>:]] are constraints, matching empty strings at
192 the beginning and end of a word respectively. A word is defined as a
193 sequence of word characters that is neither preceded nor followed by
194 word characters. A word character is an alnum character or an under-
195 score (_). These special bracket expressions are deprecated; users of
196 AREs should use constraint escapes instead (see below).
197
198 ESCAPES
199 Escapes (AREs only), which begin with a \ followed by an alphanumeric
200 character, come in several varieties: character entry, class short-
201 hands, constraint escapes, and back references. A \ followed by an
202 alphanumeric character but not constituting a valid escape is illegal
203 in AREs. In EREs, there are no escapes: outside a bracket expression,
204 a \ followed by an alphanumeric character merely stands for that char-
205 acter as an ordinary character, and inside a bracket expression, \ is
206 an ordinary character. (The latter is the one actual incompatibility
207 between EREs and AREs.)
208
209 Character-entry escapes (AREs only) exist to make it easier to specify
210 non-printing and otherwise inconvenient characters in REs:
211
212 \a alert (bell) character, as in C
213
214 \b backspace, as in C
215
216 \B synonym for \ to help reduce backslash doubling in some applica-
217 tions where there are multiple levels of backslash processing
218
219 \cX (where X is any character) the character whose low-order 5 bits
220 are the same as those of X, and whose other bits are all zero
221
222 \e the character whose collating-sequence name is `ESC', or failing
223 that, the character with octal value 033
224
225 \f formfeed, as in C
226
227 \n newline, as in C
228
229 \r carriage return, as in C
230
231 \t horizontal tab, as in C
232
233 \uwxyz
234 (where wxyz is exactly four hexadecimal digits) the Unicode
235 character U+wxyz in the local byte ordering
236
237 \Ustuvwxyz
238 (where stuvwxyz is exactly eight hexadecimal digits) reserved
239 for a somewhat-hypothetical Unicode extension to 32 bits
240
241 \v vertical tab, as in C are all available.
242
243 \xhhh
244 (where hhh is any sequence of hexadecimal digits) the character
245 whose hexadecimal value is 0xhhh (a single character no matter
246 how many hexadecimal digits are used).
247
248 \0 the character whose value is 0
249
250 \xy (where xy is exactly two octal digits, and is not a back refer-
251 ence (see below)) the character whose octal value is 0xy
252
253 \xyz (where xyz is exactly three octal digits, and is not a back ref-
254 erence (see below)) the character whose octal value is 0xyz
255
256 Hexadecimal digits are `0'-`9', `a'-`f', and `A'-`F'. Octal digits are
257 `0'-`7'.
258
259 The character-entry escapes are always taken as ordinary characters.
260 For example, \135 is ] in ASCII, but \135 does not terminate a bracket
261 expression. Beware, however, that some applications (e.g., C compil-
262 ers) interpret such sequences themselves before the regular-expression
263 package gets to see them, which may require doubling (quadrupling,
264 etc.) the `\'.
265
266 Class-shorthand escapes (AREs only) provide shorthands for certain com-
267 monly-used character classes:
268
269 \d [[:digit:]]
270
271 \s [[:space:]]
272
273 \w [[:alnum:]_] (note underscore)
274
275 \D [^[:digit:]]
276
277 \S [^[:space:]]
278
279 \W [^[:alnum:]_] (note underscore)
280
281 Within bracket expressions, `\d', `\s', and `\w' lose their outer
282 brackets, and `\D', `\S', and `\W' are illegal. (So, for example, [a-
283 c\d] is equivalent to [a-c[:digit:]]. Also, [a-c\D], which is equiva-
284 lent to [a-c^[:digit:]], is illegal.)
285
286 A constraint escape (AREs only) is a constraint, matching the empty
287 string if specific conditions are met, written as an escape:
288
289 \A matches only at the beginning of the string (see MATCHING,
290 below, for how this differs from `^')
291
292 \m matches only at the beginning of a word
293
294 \M matches only at the end of a word
295
296 \y matches only at the beginning or end of a word
297
298 \Y matches only at a point that is not the beginning or end of a
299 word
300
301 \Z matches only at the end of the string (see MATCHING, below, for
302 how this differs from `$')
303
304 \m (where m is a nonzero digit) a back reference, see below
305
306 \mnn (where m is a nonzero digit, and nn is some more digits, and
307 the decimal value mnn is not greater than the number of closing
308 capturing parentheses seen so far) a back reference, see below
309
310 A word is defined as in the specification of [[:<:]] and [[:>:]] above.
311 Constraint escapes are illegal within bracket expressions.
312
313 A back reference (AREs only) matches the same string matched by the
314 parenthesized subexpression specified by the number, so that (e.g.)
315 ([bc])\1 matches bb or cc but not `bc'. The subexpression must
316 entirely precede the back reference in the RE. Subexpressions are num-
317 bered in the order of their leading parentheses. Non-capturing paren-
318 theses do not define subexpressions.
319
320 There is an inherent historical ambiguity between octal character-entry
321 escapes and back references, which is resolved by heuristics, as hinted
322 at above. A leading zero always indicates an octal escape. A single
323 non-zero digit, not followed by another digit, is always taken as a
324 back reference. A multi-digit sequence not starting with a zero is
325 taken as a back reference if it comes after a suitable subexpression
326 (i.e. the number is in the legal range for a back reference), and oth-
327 erwise is taken as octal.
328
329 METASYNTAX
330 In addition to the main syntax described above, there are some special
331 forms and miscellaneous syntactic facilities available.
332
333 Normally the flavor of RE being used is specified by application-depen-
334 dent means. However, this can be overridden by a director. If an RE
335 of any flavor begins with `***:', the rest of the RE is an ARE. If an
336 RE of any flavor begins with `***=', the rest of the RE is taken to be
337 a literal string, with all characters considered ordinary characters.
338
339 An ARE may begin with embedded options: a sequence (?xyz) (where xyz is
340 one or more alphabetic characters) specifies options affecting the rest
341 of the RE. These supplement, and can override, any options specified
342 by the application. The available option letters are:
343
344 b rest of RE is a BRE
345
346 c case-sensitive matching (usual default)
347
348 e rest of RE is an ERE
349
350 i case-insensitive matching (see MATCHING, below)
351
352 m historical synonym for n
353
354 n newline-sensitive matching (see MATCHING, below)
355
356 p partial newline-sensitive matching (see MATCHING, below)
357
358 q rest of RE is a literal (``quoted'') string, all ordinary charac-
359 ters
360
361 s non-newline-sensitive matching (usual default)
362
363 t tight syntax (usual default; see below)
364
365 w inverse partial newline-sensitive (``weird'') matching (see MATCH-
366 ING, below)
367
368 x expanded syntax (see below)
369
370 Embedded options take effect at the ) terminating the sequence. They
371 are available only at the start of an ARE, and may not be used later
372 within it.
373
374 In addition to the usual (tight) RE syntax, in which all characters are
375 significant, there is an expanded syntax, available in all flavors of
376 RE with the -expanded switch, or in AREs with the embedded x option.
377 In the expanded syntax, white-space characters are ignored and all
378 characters between a # and the following newline (or the end of the RE)
379 are ignored, permitting paragraphing and commenting a complex RE.
380 There are three exceptions to that basic rule:
381
382 a white-space character or `#' preceded by `\' is retained
383
384 white space or `#' within a bracket expression is retained
385
386 white space and comments are illegal within multi-character symbols
387 like the ARE `(?:' or the BRE `\('
388
389 Expanded-syntax white-space characters are blank, tab, newline, and any
390 character that belongs to the space character class.
391
392 Finally, in an ARE, outside bracket expressions, the sequence `(?#ttt)'
393 (where ttt is any text not containing a `)') is a comment, completely
394 ignored. Again, this is not allowed between the characters of multi-
395 character symbols like `(?:'. Such comments are more a historical
396 artifact than a useful facility, and their use is deprecated; use the
397 expanded syntax instead.
398
399 None of these metasyntax extensions is available if the application (or
400 an initial ***= director) has specified that the user's input be
401 treated as a literal string rather than as an RE.
402
403 MATCHING
404 In the event that an RE could match more than one substring of a given
405 string, the RE matches the one starting earliest in the string. If the
406 RE could match more than one substring starting at that point, its
407 choice is determined by its preference: either the longest substring,
408 or the shortest.
409
410 Most atoms, and all constraints, have no preference. A parenthesized
411 RE has the same preference (possibly none) as the RE. A quantified
412 atom with quantifier {m} or {m}? has the same preference (possibly
413 none) as the atom itself. A quantified atom with other normal quanti-
414 fiers (including {m,n} with m equal to n) prefers longest match. A
415 quantified atom with other non-greedy quantifiers (including {m,n}?
416 with m equal to n) prefers shortest match. A branch has the same pref-
417 erence as the first quantified atom in it which has a preference. An
418 RE consisting of two or more branches connected by the | operator
419 prefers longest match.
420
421 Subject to the constraints imposed by the rules for matching the whole
422 RE, subexpressions also match the longest or shortest possible sub-
423 strings, based on their preferences, with subexpressions starting ear-
424 lier in the RE taking priority over ones starting later. Note that
425 outer subexpressions thus take priority over their component subexpres-
426 sions.
427
428 Note that the quantifiers {1,1} and {1,1}? can be used to force long-
429 est and shortest preference, respectively, on a subexpression or a
430 whole RE.
431
432 Match lengths are measured in characters, not collating elements. An
433 empty string is considered longer than no match at all. For example,
434 bb* matches the three middle characters of `abbbc',
435 (week|wee)(night|knights) matches all ten characters of `weeknights',
436 when (.*).* is matched against abc the parenthesized subexpression
437 matches all three characters, and when (a*)* is matched against bc both
438 the whole RE and the parenthesized subexpression match an empty string.
439
440 If case-independent matching is specified, the effect is much as if all
441 case distinctions had vanished from the alphabet. When an alphabetic
442 that exists in multiple cases appears as an ordinary character outside
443 a bracket expression, it is effectively transformed into a bracket
444 expression containing both cases, so that x becomes `[xX]'. When it
445 appears inside a bracket expression, all case counterparts of it are
446 added to the bracket expression, so that [x] becomes [xX] and [^x]
447 becomes `[^xX]'.
448
449 If newline-sensitive matching is specified, . and bracket expressions
450 using ^ will never match the newline character (so that matches will
451 never cross newlines unless the RE explicitly arranges it) and ^ and $
452 will match the empty string after and before a newline respectively, in
453 addition to matching at beginning and end of string respectively. ARE
454 \A and \Z continue to match beginning or end of string only.
455
456 If partial newline-sensitive matching is specified, this affects . and
457 bracket expressions as with newline-sensitive matching, but not ^ and
458 `$'.
459
460 If inverse partial newline-sensitive matching is specified, this
461 affects ^ and $ as with newline-sensitive matching, but not . and
462 bracket expressions. This isn't very useful but is provided for symme-
463 try.
464
465 LIMITS AND COMPATIBILITY
466 No particular limit is imposed on the length of REs. Programs intended
467 to be highly portable should not employ REs longer than 256 bytes, as a
468 POSIX-compliant implementation can refuse to accept such REs.
469
470 The only feature of AREs that is actually incompatible with POSIX EREs
471 is that \ does not lose its special significance inside bracket expres-
472 sions. All other ARE features use syntax which is illegal or has unde-
473 fined or unspecified effects in POSIX EREs; the *** syntax of directors
474 likewise is outside the POSIX syntax for both BREs and EREs.
475
476 Many of the ARE extensions are borrowed from Perl, but some have been
477 changed to clean them up, and a few Perl extensions are not present.
478 Incompatibilities of note include `\b', `\B', the lack of special
479 treatment for a trailing newline, the addition of complemented bracket
480 expressions to the things affected by newline-sensitive matching, the
481 restrictions on parentheses and back references in lookahead con-
482 straints, and the longest/shortest-match (rather than first-match)
483 matching semantics.
484
485 The matching rules for REs containing both normal and non-greedy quan-
486 tifiers have changed since early beta-test versions of this package.
487 (The new rules are much simpler and cleaner, but don't work as hard at
488 guessing the user's real intentions.)
489
490 Henry Spencer's original 1986 regexp package, still in widespread use
491 (e.g., in pre-8.1 releases of Tcl), implemented an early version of
492 today's EREs. There are four incompatibilities between regexp's near-
493 EREs (`RREs' for short) and AREs. In roughly increasing order of sig-
494 nificance:
495
496 In AREs, \ followed by an alphanumeric character is either an
497 escape or an error, while in RREs, it was just another way of
498 writing the alphanumeric. This should not be a problem because
499 there was no reason to write such a sequence in RREs.
500
501 { followed by a digit in an ARE is the beginning of a bound,
502 while in RREs, { was always an ordinary character. Such
503 sequences should be rare, and will often result in an error
504 because following characters will not look like a valid bound.
505
506 In AREs, \ remains a special character within `[]', so a literal
507 \ within [] must be written `\\'. \\ also gives a literal \
508 within [] in RREs, but only truly paranoid programmers routinely
509 doubled the backslash.
510
511 AREs report the longest/shortest match for the RE, rather than
512 the first found in a specified search order. This may affect
513 some RREs which were written in the expectation that the first
514 match would be reported. (The careful crafting of RREs to opti-
515 mize the search order for fast matching is obsolete (AREs exam-
516 ine all possible matches in parallel, and their performance is
517 largely insensitive to their complexity) but cases where the
518 search order was exploited to deliberately find a match which
519 was not the longest/shortest will need rewriting.)
520
521
522 BASIC REGULAR EXPRESSIONS
523 BREs differ from EREs in several respects. `|', `+', and ? are ordi-
524 nary characters and there is no equivalent for their functionality.
525 The delimiters for bounds are \{ and `\}', with { and } by themselves
526 ordinary characters. The parentheses for nested subexpressions are \(
527 and `\)', with ( and ) by themselves ordinary characters. ^ is an
528 ordinary character except at the beginning of the RE or the beginning
529 of a parenthesized subexpression, $ is an ordinary character except at
530 the end of the RE or the end of a parenthesized subexpression, and * is
531 an ordinary character if it appears at the beginning of the RE or the
532 beginning of a parenthesized subexpression (after a possible leading
533 `^'). Finally, single-digit back references are available, and \< and
534 \> are synonyms for [[:<:]] and [[:>:]] respectively; no other escapes
535 are available.
536
537
538 SEE ALSO
539 RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
540
541
542 KEYWORDS