[java] 汇率换算器实现(3)

1 系列文章地址
2 前言
3 提取简单表单信息
- 3.1 Java正则表达式实现简单表单提取
- 3.2 重新整理HtmlTable类
4 总结

1 系列文章地址

2 前言

在上一篇文章中, 我们充分了解了正则表达式的使用细则. 那么此处就结合java.util.regex库的使用, 实现HtmlTableParse类, 用于提取网页中table的内容.

3 提取简单表单信息

html表格的示例如下:

<table border="1">
<tr>
  <th>Month</th>
  <th>Savings</th>
</tr>
<tr>
  <td>January</td>
  <td>$100</td>
</tr>
</table>

将代码合并为一行后得:

<table border="1"><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr></table>

针对上面一行书写相关的正则表达式, 获取表单中的内容:

<table.*?>((<tr>.*?</tr>)+?)</table>

这样 $1 就对应着 <tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr> , 接着对匹配后的结果再次进行处理, 使用得正则表达式为:

<tr>(.*?)</tr>

如此, 匹配得到每一行的内容, 如: $1 = <th>Month</th><th>Savings</th>, 接着再使用正则表达式:

<th>(.*?)</th>

就能够得到不同元素, 如:Month, Savings

3.1 Java正则表达式实现简单表单提取

import java.util.regex.*;

public class HtmlTable {
    public static void main(String[] args) {
        // 目标
        String target = "<table border="1"><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr></table>";

        // 正则表达式
        String regexTable = "<table.*?>((<tr>.*?</tr>)+?)</table>";
        String regexRow = "<tr>(.*?)</tr>";
        String regexEle = "(?:<th>|<td>)(.*?)(?:</th>|</td>)";

        Pattern r = Pattern.compile(regexTable);

        // 表单的匹配
        Matcher mTable = r.matcher(target);

        while (mTable.find()) {
            String strRow = mTable.group(1);
            System.out.println("Row: "+strRow);

            // 表单中每一行得匹配
            Matcher mRow = Pattern.compile(regexRow).matcher(strRow);
            while (mRow.find()) {
                String strEle = mRow.group(1);
                System.out.println("	Th or td: " + strEle);

                // 每一行中每个元素得匹配
                Matcher mEle = Pattern.compile(regexEle).matcher(strEle);
                while (mEle.find()) {
                    String result = mEle.group(1);
                    System.out.println("		Element: " + result);
                }
            }
        }
    }
}

但当上述的程序直接运用到 www.usd-cny.com 上时, 发现最终输出的结果为空. 也就是说一点都没有得到匹配. 这是因为上述的匹配规则过于特殊导致的, 下面给出更为普遍的匹配规则, 能够匹配如下面的格式:

格式:
<TR bgcolor=""></TR>
<TD WIDTH=""></TD>
<TD>  <DIV ALIGN="center"><b><font color"">element</font></b></td>

匹配规则:
   final static String REGEX_TABLE = "<table.*?>\s*?((<tr.*?>.*?</tr>)+?)\s*?</table>";
   final static String REGEX_ROW = "<tr.*?>\s*?(.*?)\s*?</tr>";
   final static String REGEX_ELE = "(?:<th.*?>|<td.*?>)(?:\s*<.*?>)*(?:&nbsp;)?(.*?)(?:&nbsp;)?(?:\s*<.*?>)*?\s*(?:</th>|</td>)";

3.2 重新整理HtmlTable类

package com.cnblogs.grassandmoon;

import java.util.regex.*;
import java.io.*;

public class HtmlTable {
    final static String ELEMENT_SEPARATOR = "01";
    final static String ROW_SEPARATOR = "02";

    final static String REGEX_TABLE = "<table.*?>\s*?((<tr.*?>.*?</tr>)+?)\s*?</table>";
    final static String REGEX_ROW = "<tr.*?>\s*?(.*?)\s*?</tr>";
    final static String REGEX_ELE = "(?:<th.*?>|<td.*?>)(?:\s*<.*?>)*(?:&nbsp;)?(.*?)(?:&nbsp;)?(?:\s*<.*?>)*?\s*(?:</th>|</td>)";


    public static String extract(int nStartLine, int nEndLine, BufferedReader br)
    throws IOException {
        String line;
        String target = "";
        String elements = "";
        int i = 0;
        // iStartLine[0] = 78;
        // iEndLine[0] = 303;

        while ((line = br.readLine()) != null) {
            ++i;
            if (i < nStartLine) continue;
            line.trim();
            target = target + line;
            if (i >= nEndLine) break;
        }

        // 正则表达式
        Pattern r = Pattern.compile(REGEX_TABLE, Pattern.CASE_INSENSITIVE);

        // 表单的匹配
        Matcher mTable = r.matcher(target);

        if (mTable.find()) {
            String strRows = mTable.group(1).trim();

            // 表单中每一行得匹配
            Matcher mRow = Pattern.compile(REGEX_ROW, Pattern.CASE_INSENSITIVE).matcher(strRows);
            while (mRow.find()) {
                boolean firstEle = true;
                String strEle = mRow.group(1).trim();
                // System.out.println("
Th or td: " + strEle);

                // 每一行中每个元素得匹配
                Matcher mEle = Pattern.compile(REGEX_ELE, Pattern.CASE_INSENSITIVE).matcher(strEle);

                if (!elements.equals(""))
                    elements = elements + ROW_SEPARATOR;
                while (mEle.find()) {
                    String result = mEle.group(1).trim();
                    if (firstEle)
                        elements = elements + result;
                    else
                        elements = elements + ELEMENT_SEPARATOR + result;
                    firstEle = false;
                    // System.out.println("
Element: " + result);
                }
                if (!elements.equals("")) {
                    int len = elements.length();
                    elements = elements.substring(0, len-2);
                }
            }
        }

        return new String(elements);
    }
}

4 总结

然后再次对实现代码进行了整理, 完整的代码见:RateExchange @ git

再后续的文中, 将介绍如何使用jsoup从网页中提取相应的信息.

Date: 2014-05-12 Mon

Author: Zhong Xiewei

Org version 7.8.11 with Emacs version 24

Validate XHTML 1.0