python简单爬虫 使用pandas解析表格,不规则表格

url = http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm

部分表格如图:

部分html代码:

<table class="MsoNormalTable" style="353.0pt;margin-left:4.65pt;border-collapse:collapse;border:none;    mso-border-alt:solid windowtext .5pt;mso-padding-alt:0cm 5.4pt 0cm 5.4pt;    mso-border-insideh:.5pt solid windowtext;mso-border-insidev:.5pt solid windowtext" width="471" cellspacing="0" cellpadding="0" border="1">
 <tbody>
  <tr class="firstRow" style="mso-yfti-irow:0;mso-yfti-firstrow:yes;height:36.75pt">
   <td style="170.0pt;border:solid windowtext 1.0pt;mso-border-alt:            solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family:            宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">学院<span lang="EN-US">
        <o:p></o:p></span></span></strong></p></td>
   <td style="183.0pt;border:solid windowtext 1.0pt;            border-left:none;mso-border-left-alt:solid windowtext .5pt;mso-border-alt:            solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:36.75pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><strong><span style="font-size:9.0pt;font-family:            宋体;mso-bidi-font-family:宋体;mso-font-kerning:0pt">专业名称<span lang="EN-US">
        <o:p></o:p></span></span></strong></p></td>
  </tr>
  <tr style="mso-yfti-irow:1;height:16.5pt">
   <td rowspan="4" style="170.0pt;border:solid windowtext 1.0pt;            border-top:none;mso-border-top-alt:solid windowtext .5pt;mso-border-alt:solid windowtext .5pt;            padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="227"><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体;            mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程学院<span lang="EN-US">450
       <o:p></o:p></span></span></p></td>
   <td style="183.0pt;border-top:none;border-left:none;            border-bottom:solid windowtext 1.0pt;border-right:solid windowtext 1.0pt;            mso-border-top-alt:solid windowtext .5pt;mso-border-left-alt:solid windowtext .5pt;            mso-border-alt:solid windowtext .5pt;padding:0cm 5.4pt 0cm 5.4pt;height:16.5pt" width="244" nowrap=""><p class="MsoNormal" style="text-align:center;margin-top:6.0pt;margin-right:0cm;            margin-bottom:6.0pt;margin-left:0cm;mso-para-margin-top:.5gd;mso-para-margin-right:            0cm;mso-para-margin-bottom:.5gd;mso-para-margin-left:0cm;            mso-pagination:widow-orphan"><span style="font-size:9.0pt;font-family:宋体;            mso-bidi-font-family:宋体;mso-font-kerning:0pt">土木工程<span lang="EN-US">
       <o:p></o:p></span></span></p></td>
  </tr>
    ......
 </tbody>
</table>

用pandas解析表格,代码如下:

import pandas as pd
url = 'http://www.hnu.edu.cn/xyxk/xkzy/zylb.htm'

table = pd.read_html(url) 
pd.set_option('display.max_rows', None)  # 显示全部的行
with open("湖南大学学院与专业.txt", "wt", encoding='utf8') as out_file:  # 保存为txt文件
    for i in table:
        out_file.write(str(i)+'
')

运行结果如下(部分):

 非常简洁高效!

原文地址:https://www.cnblogs.com/cttcarrotsgarden/p/10769097.html