雖然apache認為JakartaORO是一個(gè)更完備的正則表達式處理包,但regexp的應用也是非常廣泛,大概是因為它的簡(jiǎn)單吧。下面是regexp的學(xué)習筆記。
2)RE類(lèi)regexp包中非常重要的一個(gè)類(lèi),它是一個(gè)高效的、輕量級的正則式計算器/匹配器的類(lèi),RE是regularexpression的縮寫(xiě)。正則式是能夠進(jìn)行復雜的字符串匹配的模板,而且當一個(gè)字符串能匹配某個(gè)模板時(shí),你可以抽取出那些匹配的部分,這在進(jìn)行文本解析時(shí)非常有用。下面討論一下正則式的語(yǔ)法。
為了編譯一個(gè)正則式,你需要簡(jiǎn)單地以模板為參數構造一個(gè)RE匹配器對象來(lái)完成,然后就可調用任一個(gè)RE.match方法來(lái)對一個(gè)字符串進(jìn)行匹配檢查,如果匹配成功/失敗,則返回真/假值。例如:
RE.getParen可以取回匹配的字符序列,或者匹配的字符序列的某一部分(如果模板中有相應的括號的話(huà)),以及它們的位置、長(cháng)度等屬性。如:
String wholeExpr = r.getParen(0); // wholeExpr will be ‘a(chǎn)aaab‘
String insideParens = r.getParen(1); // insideParens will be ‘a(chǎn)aaa‘
int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
int endWholeExpr = r.getParenEnd(0); // endWholeExpr will be index 6
int lenWholeExpr = r.getParenLength(0); // lenWholeExpr will be 5
int startInside = r.getParenStart(1); // startInside will be index 1
int endInside = r.getParenEnd(1); // endInside will be index 5
int lenInside = r.getParenLength(1); // lenInside will be 4
RE支持正則式的后向引用,如:
3)RE支持的正則式的語(yǔ)法如下:
字符
| unicodeChar | Matches any identical unicode character |
| \ | Used to quote a meta-character (like ‘*‘) |
| \\ | Matches a single ‘\‘ character |
| \0nnn | Matches a given octal character |
| \xhh | Matches a given 8-bit hexadecimal character |
| \\uhhhh | Matches a given 16-bit hexadecimal character |
| \t | Matches an ASCII tab character |
| \n | Matches an ASCII newline character |
| \r | Matches an ASCII return character |
| \f | Matches an ASCII form feed character |
| [abc] | 簡(jiǎn)單字符集 |
| [a-zA-Z] | 帶區間的字符集 |
| [^abc] | 字符集的否定 |
| [:alnum:] | Alphanumeric characters. |
| [:alpha:] | Alphabetic characters. |
| [:blank:] | Space and tab characters. |
| [:cntrl:] | Control characters. |
| [:digit:] | Numeric characters. |
| [:graph:] | Characters that are printable and are also visible.(A space is printable, but not visible, while an `a‘ is both.) |
| [:lower:] | Lower-case alphabetic characters. |
| [:print:] | Printable characters (characters that are not control characters.) |
| [:punct:] | Punctuation characters (characters that are not letter,digits, control characters, or space characters). |
| [:space:] | Space characters (such as space, tab, and formfeed, to name a few). |
| [:upper:] | Upper-case alphabetic characters. |
| [:xdigit:] | Characters that are hexadecimal digits. |
| [:javastart:] | Start of a Java identifier |
| [:javapart:] | Part of a Java identifier |
| . | Matches any character other than newline |
| \w | Matches a "word" character (alphanumeric plus "_") |
| \W | Matches a non-word character |
| \s | Matches a whitespace character |
| \S | Matches a non-whitespace character |
| \d | Matches a digit character |
| \D | Matches a non-digit character |
| ^ | Matches only at the beginning of a line |
| $ | Matches only at the end of a line |
| \b | Matches only at a word boundary |
| \B | Matches only at a non-word boundary |
| A* | Matches A 0 or more times (greedy) |
| A+ | Matches A 1 or more times (greedy) |
| A? | Matches A 1 or 0 times (greedy) |
| A{n} | Matches A exactly n times (greedy) |
| A{n,} | Matches A at least n times (greedy) |
| A*? | Matches A 0 or more times (reluctant) |
| A+? | Matches A 1 or more times (reluctant) |
| A?? | Matches A 0 or 1 times (reluctant) |
| AB | Matches A followed by B |
| A|B | Matches either A or B |
| (A) | Used for subexpression grouping |
| (?:A) | Used for subexpression clustering (just like grouping but no backrefs) |
| \1 | Backreference to 1st parenthesized subexpression |
| \2 | Backreference to 2nd parenthesized subexpression |
| \3 | Backreference to 3rd parenthesized subexpression |
| \4 | Backreference to 4th parenthesized subexpression |
| \5 | Backreference to 5th parenthesized subexpression |
| \6 | Backreference to 6th parenthesized subexpression |
| \7 | Backreference to 7th parenthesized subexpression |
| \8 | Backreference to 8th parenthesized subexpression |
| \9 | Backreference to 9th parenthesized subexpression |
RE運行的程序先經(jīng)過(guò)RECompiler類(lèi)的編譯. 由于效率的原因,RE匹配器沒(méi)有包括正則式的編譯類(lèi). 實(shí)際上,如果要預編譯1個(gè)或多個(gè)正則式,可以通過(guò)命令行運行‘recompile‘類(lèi),如
通過(guò)利用預編譯的req來(lái)構建RE匹配器對象,可以避免運行時(shí)進(jìn)行編譯的成本。如果需要動(dòng)態(tài)的構造正則式,則可以創(chuàng )建單獨一個(gè)RECompiler對象,并利用它來(lái)編譯每個(gè)正則式。注意,RE 和 RECompiler都不是threadsafe的(出于效率的原因), 因此當多線(xiàn)程運行時(shí),你需要為每個(gè)線(xiàn)程分別創(chuàng )建編譯器和匹配器。
參考資料
1、 Jeffrey Hunter‘s README_regular_expressions.txt |
http://www.idevelopment.info/topics/topics.cgi?LEVEL=programming
2、The Jakarta Site – CVS Repository
http://jakarta.apache.org/site/cvsindex.html
聯(lián)系客服