org.apache.xerces.impl.xpath.regex
Class RegularExpression
java.lang.Object
org.apache.xerces.impl.xpath.regex.RegularExpression
- java.io.Serializable
public class RegularExpression
extends java.lang.Object
implements java.io.Serializable
A regular expression matching engine using Non-deterministic Finite Automaton (NFA).
This engine does not conform to the POSIX regular expression.
How to use
RegularExpression re = new RegularExpression(regex);
if (re.matches(text)) { ... }
RegularExpression re = new RegularExpression(regex);
Match match = new Match();
if (re.matches(text, match)) {
... // You can refer captured texts with methods of the Match
class.
}
Case-insensitive matching
RegularExpression re = new RegularExpression(regex, "i");
if (re.matches(text) >= 0) { ...}
Options
You can specify options to
RegularExpression(
regex,
options)
or
setPattern(
regex,
options)
.
This
options parameter consists of the following characters.
"i"
"m"
- ^$
"s"
- .
"u"
- \d \D \w \W \s \S \b \B \< \>
"w"
- \b \B \< \>\b \B \< \>
","
- [a,b]a,b[a,b]ab
"X"
- XML Schema: Regular Expression
match()
Syntax
Differences from the Perl 5 regular expression
- There is 6-digit hexadecimal character representation (\u005cvHHHHHH.)
- Supports subtraction, union, and intersection operations for character classes.
- Not supported: \ooo (Octal character representations),
\G, \C, \lc,
\u005c uc, \L, \U,
\E, \Q, \N{name},
(?{code}), (??{code})
|
Meta characters are `
. * + ? { [ ( ) | \ ^ $'.
- Character
- .
- the "s" option
- \e \f \n \r \t
- \cC
- C@AZ[\u005c]^_C
- \cJ\c[
- \
- \u005cxHH\u005cx{HHHH}
- HH\u005cxHH\u005cx{HHHH}
- \u005c uHHHH
- HHHH
- \u005cvHHHHHH
- HHHHHH
- \g
- (?[\p{ASSIGNED}]-[\p{M}\p{C}])?(?:\p{M}|[\x{094D}\x{09CD}\x{0A4D}\x{0ACD}\x{0B3D}\x{0BCD}\x{0C4D}\x{0CCD}\x{0D4D}\x{0E3A}\x{0F84}]\p{L}|[\x{1160}-\x{11A7}]|[\x{11A8}-\x{11FF}]|[\x{FF9E}\x{FF9F}])*
- \X
- (?:\PM\pM*)
- Character class
+ * - [R1R2...Rn]"," option
- [R1,R2,...,Rn]"," option
- Rn
- A character (including \e \f \n \r \t \u005cxHH \u005cx{HHHH} <!--kbd>\u005c uHHHH \u005cvHHHHHH)
This range matches the character.
- C1-C2
This range matches a character which has a code point that is >= C1's code point and <= C2's code point.
+ *
- A POSIX character class: [:alpha:] [:alnum:] [:ascii:] [:cntrl:] [:digit:] [:graph:] [:lower:] [:print:] [:punct:] [:space:] [:upper:] [:xdigit:],
+ * and negative POSIX character classes in Perl like [:^alpha:]
...
- \d \D \s \S \w \W \p{name} \P{name}
These expressions specifies the same ranges as the following expressions.
Enumerated ranges are merged (union operation).
[a-ec-z] is equivalent to
[a-z]
[^R1R2...Rn]"," option[^R1,R2,...,Rn]"," option(?[ranges]op[ranges]op[ranges])op-+&(?[A-Z]-[CF])[A-BD-EG-Z](?[0x00-0x7f]-[K]&[\p{Lu}])[A-JL-Z]positive character class(?[^b])[\x00-ac-\x{10ffff}][^b](?[^b])BBb[^b][^Bb][R1R2...-[RnRn+1...]]"X" option"X" option\d[0-9]a "u" option\p{Nd}\D[^0-9]a "u" option\P{Nd}\s[ \f\n\r\t]a "u" option[ \f\n\r\t\p{Z}]\S[^ \f\n\r\t]a "u" option[^ \f\n\r\t\p{Z}]\w[a-zA-Z0-9_]a "u" option[\p{Lu}\p{Ll}\p{Lo}\p{Nd}_]\W[^a-zA-Z0-9_]a "u" option[^\p{Lu}\p{Ll}\p{Lo}\p{Nd}_]\p{name}UnicodeData.txtBlock
-
L, M, N, Z, C, P, S, Lu, Ll, Lt, Lm, Lo, Mn, Me, Mc, Nd, Nl, No, Zs, Zl, Zp,
Cc, Cf, Cn, Co, Cs, Pd, Ps, Pe, Pc, Po, Sm, Sc, Sk, So,
-
Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B,
IPA Extensions, Spacing Modifier Letters, Combining Diacritical Marks, Greek,
Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati,
Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, Georgian,
Hangul Jamo, Latin Extended Additional, Greek Extended, General Punctuation,
Superscripts and Subscripts, Currency Symbols, Combining Marks for Symbols,
Letterlike Symbols, Number Forms, Arrows, Mathematical Operators,
Miscellaneous Technical, Control Pictures, Optical Character Recognition,
Enclosed Alphanumerics, Box Drawing, Block Elements, Geometric Shapes,
Miscellaneous Symbols, Dingbats, CJK Symbols and Punctuation, Hiragana,
Katakana, Bopomofo, Hangul Compatibility Jamo, Kanbun,
Enclosed CJK Letters and Months, CJK Compatibility, CJK Unified Ideographs,
Hangul Syllables, High Surrogates, High Private Use Surrogates, Low Surrogates,
Private Use, CJK Compatibility Ideographs, Alphabetic Presentation Forms,
Arabic Presentation Forms-A, Combining Half Marks, CJK Compatibility Forms,
Small Form Variants, Arabic Presentation Forms-B, Specials,
Halfwidth and Fullwidth Forms
- ALL[\u005cu0000-\u005cv10FFFF]
- ASSGINED\p{ASSIGNED}\P{Cn}
- UNASSGINED\p{UNASSIGNED}\p{Cn}
\P{name}Selection and Quantifier
- X|Y
- X*
- X
- X+
- X
- X?
- X
- X{number}
- number
- X{min,}
- X{min,max}
- X*?
- X+?
- X??
- X{min,}?
- X{min,max}?
Grouping, Capturing, and Back-reference
- (?:X)
- foo+foofoooofoofoofoofoofoo(?:foo)+
- (X)
Match
matches(String,Match)
NN *([^<:]*) +<([^>]*)> *From: TAMURA Kent <kent@trl.ibm.co.jp>Match.getCapturedText(0)
:
" TAMURA Kent <kent@trl.ibm.co.jp>"
Match.getCapturedText(1)
: "TAMURA Kent"
Match.getCapturedText(2)
: "kent@trl.ibm.co.jp"
- \1 \2 \3 \4 \5 \6 \7 \8 \9
- (?>X)
- (?options:X)
- (?options-options2:X)
- optionsoptions2
- (?options)
- (?options-options2)
Anchor
- \A
- \Z
- \z
- ^
- \A
- a "m" option
- $
- a "m" option
- \b
- a "w" option
- \B
- a "w" option
- \<
- a "w" option
- \>
- a "w" option
Lookahead and lookbehind
- (?=X)
- (?!X)
- (?<=X)
- (?<!X)
Misc.
- (?(condition)yes-pattern|no-pattern)
- (?(condition)yes-pattern)
- (?#comment)
- )
BNF for the regular expression
regex ::= ('(?' options ')')? term ('|' term)*
term ::= factor+
factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )?
| '(?#' [^)]* ')'
minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}'
atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9]
| '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X'
| '(?>' regex ')' | '(?' options ':' regex ')'
| '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')'
options ::= [imsw]* ('-' [imsw]+)?
anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>'
looks ::= '(?=' regex ')' | '(?!' regex ')'
| '(?<=' regex ')' | '(?<!' regex ')'
char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1
category-block ::= '\' [pP] category-symbol-1
| ('\p{' | '\P{') (category-symbol | block-name
| other-properties) '}'
category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S'
category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo'
| 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No'
| 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs'
| 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po'
| 'Sm' | 'Sc' | 'Sk' | 'So'
block-name ::= (See above)
other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED'
character-1 ::= (any character except meta-characters)
char-class ::= '[' ranges ']'
| '(?[' ranges ']' ([-+&] '[' ranges ']')? ')'
ranges ::= '^'? (range ','?)+
range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block
| range-char | range-char '-' range-char
range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2
code-point ::= '\x' hex-char hex-char
| '\x{' hex-char+ '}'
<!-- | '\u005c u' hex-char hex-char hex-char hex-char
--> | '\v' hex-char hex-char hex-char hex-char hex-char hex-char
hex-char ::= [0-9a-fA-F]
character-2 ::= (any character except \[]-,)
TODO
Parsing performance
$Id: RegularExpression.java,v 1.10 2005/03/22 03:26:24 mrglavas Exp $- TAMURA Kent <kent@trl.ibm.co.jp>
RegularExpression(String regex) - Creates a new RegularExpression instance.
|
RegularExpression(String regex, String options) - Creates a new RegularExpression instance with options.
|
boolean | equals(Object obj) - Return true if patterns are the same and the options are equivalent.
|
int | getNumberOfGroups() - Return the number of regular expression groups.
|
String | getOptions() - Returns a option string.
|
String | getPattern()
|
int | hashCode()
|
boolean | matches(CharacterIterator target) - Checks whether the target text contains this pattern or not.
|
boolean | matches(CharacterIterator target, Match match) - Checks whether the target text contains this pattern or not.
|
boolean | matches(String target) - Checks whether the target text contains this pattern or not.
|
boolean | matches(String target, int start, int end) - Checks whether the target text contains this pattern
in specified range or not.
|
boolean | matches(String target, int start, int end, Match match) - Checks whether the target text contains this pattern
in specified range or not.
|
boolean | matches(String target, Match match) - Checks whether the target text contains this pattern or not.
|
boolean | matches(char[] target) - Checks whether the target text contains this pattern or not.
|
boolean | matches(char[] target, int start, int end) - Checks whether the target text contains this pattern
in specified range or not.
|
boolean | matches(char[] target, int start, int end, Match match) - Checks whether the target text contains this pattern
in specified range or not.
|
boolean | matches(char[] target, Match match) - Checks whether the target text contains this pattern or not.
|
void | setPattern(String newPattern)
|
void | setPattern(String newPattern, String options)
|
String | toString() - Represents this instence in String.
|
RegularExpression
public RegularExpression(String regex)
throws ParseException
Creates a new RegularExpression instance.
regex
- A regular expression
RegularExpression
public RegularExpression(String regex,
String options)
throws ParseException
Creates a new RegularExpression instance with options.
regex
- A regular expressionoptions
- A String consisted of "i" "m" "s" "u" "w" "," "X"
equals
public boolean equals(Object obj)
Return true if patterns are the same and the options are equivalent.
getNumberOfGroups
public int getNumberOfGroups()
Return the number of regular expression groups.
This method returns 1 when the regular expression has no capturing-parenthesis.
getOptions
public String getOptions()
Returns a option string.
The order of letters in it may be different from a string specified
in a constructor or setPattern()
.
RegularExpression(java.lang.String,java.lang.String)
, setPattern(java.lang.String,java.lang.String)
getPattern
public String getPattern()
hashCode
public int hashCode()
matches
public boolean matches(CharacterIterator target)
Checks whether the target text contains this pattern or not.
- true if the target is matched to this regular expression.
matches
public boolean matches(CharacterIterator target,
Match match)
Checks whether the target text contains this pattern or not.
match
- A Match instance for storing matching result.
- Offset of the start position in target; or -1 if not match.
matches
public boolean matches(String target)
Checks whether the target text contains this pattern or not.
- true if the target is matched to this regular expression.
matches
public boolean matches(String target,
int start,
int end)
Checks whether the target text contains this pattern
in specified range or not.
start
- Start offset of the range.end
- End offset +1 of the range.
- true if the target is matched to this regular expression.
matches
public boolean matches(String target,
int start,
int end,
Match match)
Checks whether the target text contains this pattern
in specified range or not.
start
- Start offset of the range.end
- End offset +1 of the range.match
- A Match instance for storing matching result.
- Offset of the start position in target; or -1 if not match.
matches
public boolean matches(String target,
Match match)
Checks whether the target text contains this pattern or not.
match
- A Match instance for storing matching result.
- Offset of the start position in target; or -1 if not match.
matches
public boolean matches(char[] target)
Checks whether the target text contains this pattern or not.
- true if the target is matched to this regular expression.
matches
public boolean matches(char[] target,
int start,
int end)
Checks whether the target text contains this pattern
in specified range or not.
start
- Start offset of the range.end
- End offset +1 of the range.
- true if the target is matched to this regular expression.
matches
public boolean matches(char[] target,
int start,
int end,
Match match)
Checks whether the target text contains this pattern
in specified range or not.
start
- Start offset of the range.end
- End offset +1 of the range.match
- A Match instance for storing matching result.
- Offset of the start position in target; or -1 if not match.
matches
public boolean matches(char[] target,
Match match)
Checks whether the target text contains this pattern or not.
match
- A Match instance for storing matching result.
- Offset of the start position in target; or -1 if not match.
setPattern
public void setPattern(String newPattern)
throws ParseException
setPattern
public void setPattern(String newPattern,
String options)
throws ParseException
toString
public String toString()
Represents this instence in String.
Copyright B) 1999-2005 Apache XML Project. All Rights Reserved.