Python实现简单的网页抓取

现在开源的网页抓取程序有很多，各种语言应有尽有。

这里分享一下Python从零开始的网页抓取过程

第一步：安装Python

点击下载适合的版本https://www.python.org/

我这里选择安装的是Python2.7.11

第二步：安装PythonIDE可以任意选择，这里安转的是PyCharm

点击下载地址：http://www.jetbrains.com/pycharm/download/#section=windows

下载安装后可以选择新建一个项目，然后把需要编译的py文件放在项目中。

第三步安装引用包

在编译过程中会发现两个包的引用失败BeautifulSoup和xlwt，前者是对html标记的解析库，后者是可以对分析后的数据导出为excel文件

BeautifulSoup下载

xlwt下载

安装方法一样，这里的安装类似Linux依赖安装包一样。

常用的安装步骤

1.在系统中PATH环境变量添加Python安装目录

2.将需要安装的包解压后打开CMD命令窗口，分别切换至安装包目录，运行分别运行python setup.py build和python setup.py install

这样两个包就安装完成了

第四步编译运行

以下是编译执行的抓取代码，这里可以根据实际需求进行改动。简单的实现网页读取，数据抓取就挺简单的。

#coding:utf-8
import urllib2
import os
import sys
import urllib
import string
from bs4 import BeautifulSoup #导入解析html源码模块
import xlwt #导入excel操作模块
row = 0

style0 = xlwt.easyxf('font: name Times SimSun')
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet('Sheet1')
for num in range(1,100):#页数控制
 url = "http://www.xxx.com/Suppliers.asp?page="+str(num)+"&hdivision=" #循环ip地址
 header = {
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0",
            "Referer":"http://www.xxx.com/suppliers.asp"
        }
 req = urllib2.Request(url,data=None,headers=header)
 ope = urllib2.urlopen(req)
    #请求创建完成
 soup = BeautifulSoup(ope.read(), 'html.parser')
 url_list = [] #当前url列表

 for _ in soup.find_all("td",class_="a_blue"):
     companyname=_.a.string.encode('utf-8').replace("
"," ").replace('|','')#公司名称
     detailc=''#厂商详情基本信息
     a_href='http://www.xxx.com/'+ _.a['href']+'' #二级页面
     temphref=_.a['href'].encode('utf-8')
     if temphref.find("otherproduct") == -1:
          print  companyname
          print  a_href
          reqs = urllib2.Request(a_href.encode('utf-8'), data=None, headers=header)
          opes = urllib2.urlopen(reqs)
          deatilsoup = BeautifulSoup(opes.read(), 'html.parser')
          for content in deatilsoup.find_all("table", class_="zh_table"): #输出第一种联系方式详情
             detailc=content.text.encode('utf-8').replace("
", "")
             #print detailc # 输出详细信息
          row = row + 1  # 添加一行
          ws.write(row,0,companyname,style0)  # 第几行，列1  列2...列n
          ws.write(row,1,   detailc,style0)
          print  '正在抓取'+str(row)
wb.save('bio-equip11-20.xls')
print '操作完成！'

运行结束则会在PycharmProjects项目的目录下创建已经采集好的数据保存excel文件。