从内容里提取出链接和标题

比如内容格式是HTML格式的,里面有一堆的链接,希望从内容里提取出链接和标题。

如:

 1 <a href='http://www.xx.cn/art/2017/12/26/art_8801_1776064.html' title='标题1' target="_blank"></a>        <p>2017-12-26</p>    </li>    ]]></record>
 2 <record><![CDATA[
 3     <li> <a href='http://www.xx.gov.cn/art/2017/12/26/art_8801_1776063.html' title='标题2' target="_blank"></a>        <p>2017-12-26</p>    </li>    ]]></record>
 4 <record><![CDATA[
 5     <li>        <a href='http://www.xx.gov.cn/art/2017/12/26/art_8801_1776060.html' title='标题3' target="_blank"></a>        <p>2017-12-26</p>    </li>    ]]></record>
 6 <record><![CDATA[
 7     <li>        <a href='http://www.xx.gov.cn/art/2017/12/26/art_8801_1776059.html' title='标题4' target="_blank"></a>        <p>2017-12-26</p>    </li>    ]]></record>
 8 <record><![CDATA[
 9     <li>        <a href='http://www.xx.gov.cn/art/2017/12/25/art_8801_1775473.html' title='标题5' target="_blank"></a>        <p>2017-12-25</p>    </li>    ]]></record>
10 <record><![CDATA[
11     <li>        <a href='http://www.xx.gov.cn/art/2017/12/22/art_8801_1775476.html' title='标题6' target="_blank"></a>        <p>2017-12-22</p>    </li>    ]]></record>
12 <record><![CDATA[

方法正则表达式

1 string htmlcontext = “”;
2 
3 Regex regex = new Regex(@"<a.*hrefs*=s*(?:""(?<url>[^""]*)""|'(?<url>[^']*)'|(?<url>[^>^s]+)).*>(?<title>[^<^>]*)<[^</a>]*/a>", RegexOptions.IgnoreCase);
4 
5 for (Match m = regex.Match(htmlcontext); m.Success; m = m.NextMatch())
6 {
7         string stringurl = m.Groups[1].Value.ToString();
8         string stringtitle = m.Groups[2].Value.ToString();
9 }

输出结果:

http://www.xx.cn/art/2017/12/26/art_8801_1776064.html   标题1

原文地址:https://www.cnblogs.com/yopo/p/8124608.html