正则表达式

正则表达式的基本语法

开始和结尾位置声明

^ 表示字符串的开始，要匹配 ^ 本身需要使用 ^ ,例如：^Yao 表示字符串以 Yao 开头
$ 表示字符串的结尾，要匹配 $ 本身需要使用 $ ,例如：Yao$ 表示字符串以 Yao 结尾

字母和数字表示

w 表示匹配任意字母、数字、下划线
W 表示匹配任意不是字母、数字、下划线的字符
d 表示匹配任意单个数字，0~9 中任意一个，等价于 [0-9]
D 表示匹配任意非数字，等价于 [^0-9]
[[:alpha:]] 表示匹配任何字母
[[:alnum:]] 表示匹配任何字母和数字

特殊字符的表示

. 表示匹配除换行外任意的字符
| 表示两个匹配条件进行逻辑或运算
表示将一个字符标记为特殊字符、文本、反向引用或八进制转义符
s 表示匹配任意空白符(空格，制表符，换页符)
S 表示匹配任意不是空白符的字符
f：匹配一个换页符，等价于x0c和cL
：匹配一个换行符，等价于x0a和cJ
：匹配一个回车符，等价于x0d和cM
：匹配一个制表符，等价于x09和cI
v：匹配一个垂直制表符，等价于x0b和cK
：匹配一个字边界，即字与空格间的位置，也就是单词和空格之间的位置，不匹配任何字符，如，er匹配never中的er，但不匹配verb中的er
B：非字边界匹配,erB匹配verb中的er，但不匹配never中的er

重复次数

{n} 表示正好匹配 n 次
{n,} 表示至少匹配 n 次
{n,m} 表示匹配 n~m 次
* 表示匹配0次或多次，等价于 {0,}
? 表示匹配至多1次可以是0次，等价于 {0,1}
+ 表示匹配至少匹配一次，等价于 {1,}

集合与补集

[] 表示一个字符集合，可以使用 - 表示一个范围，例如 [0-9] 表示 0~9 十个数字的集合
[^] 表示补集，例如 [^7-9] 表示除 7,8,9 以外的数字(0,1,2,3,4,5,6)

分组

() 表示分组，| 表示或，例如：·(cpp|cxx)

正则表达式库

C++ 正则表达式库(RE 库)是新标准库的一部分，RE库定义在头文件 regex 中。
regex 中包含的组件：

组件	说明
regex	表示有一个正则表达式的类
regex_match	将一个字符序列与一个正则表达式匹配
regex_search	寻找第一个与正则表达式匹配的子序列
regex_replace	使用给定的格式替换一个正则表达式
sregex_iterator	迭代器适配器，调用 regex_search 来遍历一个 string 中所有匹配的子串
smatch	容器类，保存在 string 中的搜索结果
ssub_match	string 中匹配的子表达式的结果

regex

regex 的构造

regex r(re)
regex r(re,f)

说明：

re 表示一个正则表达式，可以是一个string、一个表示字符范围的迭代器对、一个指向空字符结尾的字符数组的指针、一个字符指针、一个计数器、一个花括号包围的字符列表。
f 指明对象如何处理的标志，默认为 ECMAScript。可选的标志包括：
- icase ：在匹配过程中忽略大小写。
- nosubs：不保存匹配的子表达式。
- optimize：执行速度优于构造速度。
- ECMAScript：使用 ECMA-262 指定的语法。
- basic：使用 POSIX 基本的正则表达式语法。
- extended：使用 POSIX 扩展的正则表达式语法。
- awk：使用 POSIX 版本的 awk 语言的语法。
- grep：使用 POSIX 版本的 grep 的语法。
- egrep：使用 POSIX 版本的 egrep 的语法。

regex 赋值

操作	说明
r1 = re	将 r1 中的正则表达式替换为 re。
r1.assign(re,f)	与使用赋值运算符 = 效果相同，可选的标志 f 也与 regex 的构造函数中对应的参数含义相同。

构造函数和赋值操作可能抛出 regex_error 的异常。

regex 的其它操作

操作	说明
r.mark_count()	r 中子表达式的数目
r.flags()	返回 r 的标志集

当编写的正则表达式存在错误时，运行时标准库会抛出一个类型为 regex_error 的异常：

regex_error 的 what 操作描述发生了什么错误。
regex_error 的 code 成员用来返回某个错误类型对应的数值编码。

RE 库可以抛出的标准错误如下表所示，编译器定义了 code 成员，编号从 0 开始：

错误类型	说明
error_collate	无效的元素校对请求
error_ctype	无效的字符类
error_escape	无效的转义字符或无效的尾置转义
error_backref	无效的向后引用
error_brack	不匹配的方括号[]
error_paren	不匹配的小括号()
error_brace	不匹配的花括号{}
error_badbrace	{}中无效的范围
error_range	无效的字符范围[z-a]
error_space	内存不足，无法处理此正则表达式
error_badrepeat	重复字符(*，？，+，{)之前没有有效的正则表达式
error_complexity	要求的匹配过于复杂
error_stack	栈空间不足，无法处理匹配

try
{
	regex r("[[:alnum:]+\.(cpp|cxx|cc)$",regex::icase);
}catch (regex_error e)
{	cout << e.what() << "
code:" << e.code() << endl;	}

输出：

regex_error(error_brack): The expression contained mismatched [ and ].
code:4

避免创建不必要的正则表达式

正则表达式所表示的程序是在运行时编译的而非编译时编译的。
正则表达式的编译是一个非常慢的动作，为了减少开销，应该避免创建不必要的 regex，特别地，如果在一个循环中使用了正则表达式，应该在循环外创建它，而不是每步迭代时都编译它。

正则表达式类和输入序列类型

输入类型	正则表达式类
string	regex、smatch、ssub_match、sregex_iterator
const char*	regex、cmatch、csub_match、cregex_iterator
wstring	wregex、wsmatch、wssub_match、wsregex_iterator
const wchar_t*	wregex、wcmatch、wcsub_match、wcregex_iterator

regex r("[[:alnum:]]+\.(cpp|cxx|cc)$", regex::icase);
smatch results;
if (regex_search("myfile.cc", results, r))  //@ 错误，输入为char*
	cout << results.str() << endl;

如果想搜索一个字符数组，就必须使用 cmatch 对象：

regex r("[[:alnum:]]+\.(cpp|cxx|cc)$", regex::icase);
cmatch results;
if (regex_search("myfile.cc", results, r))  //@ 错误，输入为char*
	cout << results.str() << endl;

regex_search、regex_match

regex_search、regex_match 的参数

regex_search 和 regex_match 的参数类型可以是：

(seq,m,r,mft)
(seq,r,mft)

说明：

返回值都是 bool 类型，表示是否找到匹配。
seq 表示字符序列，可以是一个string、表示范围的一对迭代器、指向空字符结尾的字符数组指针。
m 是一个 match 对象，用来保存匹配结果的相关细节，match 和 seq 必须是兼容的类型。
mft 是一个可选的 regex_constants::match_flag_type 值，它们将影响匹配过程。

匹配与 Regex 迭代器类型

regex 迭代器是一种迭代器适配器，被绑定到一个输入序列和一个 regex 对象上。

sregex_iterator、cregex_iterator、wsregex_iterator、wcregex_iterator 操作：

操作	说明
sregex_iterator it(b,e,r)；	一个 sregex_iterator，遍历迭代器 b 和 e 表示的string。它调用 sregex_search(b,e,r) 将 it 定位到输入中第一个匹配的位置。
sregex_iterator end；	sregex_iterator 的尾后迭代器。
*it	根据最后调用 regex_search 的结果，返回一个 smatch 对象的引用。
it->	根据最后调用 regex_search 的结果，返回一个指向 smatch 对象的指针。
++it	从输入序列当前匹配位置开始调用 regex_search。返回递增后迭代器。
it++	从输入序列当前匹配位置开始调用 regex_search。返回旧值。
it1 == it2	如果两个 sregex_iterator 都是尾后迭代器，则它们相等；两个非尾后迭代器是从相同的输入序列和 regex 对象构造，则它们相等。。
it1 != it2

当 sregex_iterator 绑定到一个 string 和一个 regex 对象时，迭代器自动定位到给定的 string 中第一个匹配位置。即，sregex_iterator 构造函数对给定的 string 和 regex 调用 regex_search。当我们解引用迭代器时，会得到一个对应最近一次搜索结果的 smatch 对象。当我们递增迭代器时，它调用 regex_search 在输入 string 中查找下一个匹配。

string test_str = "freind white receipt receive theif";
string pattern("[^c]ei");
pattern = "[[:alpha:]]*" + pattern + "[[:alpha:]]*";
regex r(pattern,regex::icase);
for (sregex_iterator it(test_str.begin(), test_str.end(), r), end_it;
it != end_it; ++it)
	cout << it->str() << endl;

smatch 操作

这些操作也适用于 cmatch、wsmatch、wcmatch 和对应的 csub_match、wssub_match、wsub_match：

操作	说明
m.ready()	如果已经通过调用 regex_search 或 regex_match 设置了m，则返回 true；否则返回 false。如果 ready 返回 false，则对 m 进行操作是未定义的
m.size()	如果匹配失败，则返回0；否则返回最近一次匹配的正则表达式中子表达式的数目
m.empty()	若 m.size()为0，则返回true
m.prefix()	一个ssub_match对象，表示当前匹配之前的序列
m.suffix()	一个ssub_match对象，表示当前匹配之后的序列
m.format(...)
m.length(n)	第n个匹配的子表达式的大小
m.position(n)	第n个子表达式矩序列开始的距离
m.str(n)	第n个子表达式匹配的string
m[n]	对应第n个子表达式的ssub_match对象
m.begin(),m.end()	表示m中sub_match元素范围的迭代器
m.cbegin(),m.cend()	表示m中sub_match元素范围的迭代器，cbegin，cend 返回const_iterator

子匹配操作

这些操作适用于 ssub_match、csub_match、wssub_match、wcsub_match：

操作	说明
matched	一个 public bool 成员，指出此 ssub_match 是否匹配了
first	public 数据成员，指向匹配序列首元素的迭代器。
second	public 数据成员，指向匹配序列尾后位置的迭代器，如果未匹配，则 first 和 second 是相等的。
length()	匹配的大小，如果 matched 为 false，则返回0 。
str()	返回一个包含输入中匹配部分的 string，如果 matched 为 false，则返回空 string。
s=ssub	将 ssub_match 对象 ssub 转化为 string 对象 s。等价于 s=ssub.str()。转换运算符不是 explicit 的。

regex_replace

正则表达式替换操作：

m.format(dest,fmt,mft)
m.format(fmt,mft)

说明：

fmt，生成格式化输出，可以是一个 string，也可以是一个指向空字符结尾的字符数组的指针。
mft，match_flag_type，默认值为 format_default。
第一个版本写入迭代器 dest 指向的目的位置；第二个版本返回一个 string 。

在输入序列中查找并替换一个正则表达时，可以调用 regex_replace。

regex_replace(dest,seq,r,fmt,mft)
regex_replace(seq,r,fmt,mft)

说明：

遍历 seq，用 regex_search 查找与 regex 对象 r 匹配的子串。
使用格式字符串 fmt 和可选的 match_flag_type 标志来生成输出。
第一个版本将输出写入到指定的迭代器 dest 指定的位置，并接受一对 seq 表示范围。第二个版本返回一个 string，保存输出。
seq、fmt 都既可以是 string 类型，也可以是一个空字符结尾的字符数组的指针。
mft 的默认值为 match_default。

匹配标志

标志	说明
match_default	等价于 format_default
match_not_bol	不将首字符作为行首处理
match_not_eol	不将尾字符作为行尾处理
match_not_bow	不将首字符作为单词首处理
match_ot_eow	不将尾字符作为单词尾处理
match_any	如果存在多于一个匹配，则返回任意一个匹配
match_not_null	不匹配任何空序列
match_continus	匹配必须从输入的首字符开始
match_prev_avail	输入序列包含第一个匹配之前的内容
format_default	用 ECMAScript 规则替换字符串
format_sed	用 POSIX sed 规则替换字符串
format_no_copy	不输出输入序列中未匹配的部分
format_first_only	只替换子表达会的第一次出现

示例

regex_match

void test_regex_match()
{
	string pattern{"\d{3}-\d{8}|\d{4}-\d{7}"}; //@ phone number
	vector<string> str{ "010-12345678","0411-1234567","021-12345678","0100-12345678" };
	try{
		regex re(pattern);

		for (auto tmp : str)
		{
			if (regex_match(tmp, re))
				cout << tmp << " matched" << endl;
			else
				cout << tmp << " unmatched" << endl;
		}
	}
	catch(regex_error e)
	{ 
		cout << e.what() <<"
code:"<<e.code() << endl; 
	}
}

regex_search

void test_regex_search()
{
	string pattern{"^http|https://\w*$"}; //@ url
	vector<std::string> str{ "http://www.baidu.com", "https://www.baidu.com",
		"abcd://124.456", "abcd http://www.baidu.com 123" };

	try {
		regex re(pattern);

		for (auto tmp : str)
		{
			if (regex_search(tmp, re))
				cout << tmp << " can serach" << endl;
			else
				cout << tmp << " can not serach" << endl;
		}
	}
	catch (regex_error e)
	{
		cout << e.what() << "
code:" << e.code() << endl;
	}
}

void test_regex_search2()
{
	string pattern{ "[a-zA-z]+://[^\s]*" };  //@ url
	string str{ "baidu addr is: http://www.baidu.com , google addr is: https://www.google.com " };

	smatch results;
	try {
		regex re(pattern);
		while (regex_search(str, results, re))
		{
			for (auto s : results)
				cout << s << endl;
			str = results.suffix().str();
		}
	}
	catch (regex_error e)
	{
		cout << e.what() << "
code:" << e.code() << endl;
	}	
}

regex_replace

void test_regex_replace()
{
	string pattern{ "\d{18}|\d{17}X" }; //@ id card
	vector<std::string> str{ "123456789012345678", "abcd123456789012345678efgh",
		"abcdefbg", "12345678901234567X" };
	string fmt{ "********" };
		
	try {
		regex re(pattern);
		for (auto tmp : str)
		{
			string res = regex_replace(tmp, re, fmt);
			cout << "src:" << tmp << " " << "ret:" << res << endl;
		}
	}
	catch (regex_error e)
	{
		cout << e.what() << "
code:" << e.code() << endl;
	}
}

int test_regex_replace2()
{
	std::string s("there is a subsequence in the string
");
	std::regex e("\b(sub)([^ ]*)");   // matches words beginning by "sub"

									   // using string/c-string (3) version:
    std::cout << std::regex_replace(s, e, "sub-$2");

	// using range/c-string (6) version:
	std::string result;
	std::regex_replace(std::back_inserter(result), s.begin(), s.end(), e, "$2");
	std::cout << result;

	// with flags:
	std::cout << std::regex_replace(s, e, "$1 and $2", std::regex_constants::format_no_copy);
	std::cout << std::endl;

	return 0;
}