Python爬虫学习

download
https://www.python.org/downloads/release/python-352/

python实现简单爬虫功能
http://www.cnblogs.com/fnng/p/3576154.html

关于api-ms-win-crt-runtimel1-1-0.dll缺失的解决方案
https://www.microsoft.com/zh-cn/download/confirmation.aspx?id=48145

can't use a string pattern on a bytes-like object
imglist = re.findall(imgre,html.decode('GBK'))

inconsistent use of tabs and space in indentation
把tab替换成空格

UnicodeDecodeError:'gbk' codec can't decode byte 0xaf in position 197:illegal multibyte sequence
html.decode('utf-8')

以下是3.5.2版本的python所能用的

#coding=utf-8
import urllib.request
import re

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'src="(.+?.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html.decode('utf-8'))
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'D://%s.jpg' % x)
        x+=1
    print(x)

 

html = getHtml("http://tieba.baidu.com/p/2460150866");

getImg(html)

如果网页是用GBK字符集,则相应做修改
charset=gbk

#coding=utf-8
import urllib.request
import re
import datetime,time

def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html

def getImg(html):
    reg = r'file="(.+?.jpg)"'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html.decode('gbk'))
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'D://06_Download//py//%s.jpg' % x)
        x+=1
    print("得到文件总数",x)


starttime= datetime.datetime.now()
html = getHtml("http://www.cmfish.com/bbs/forum.php?mod=viewthread&tid=306167&extra=page%3D1");
getImg(html)
usetime= datetime.datetime.now()-starttime
print('所花时间:',usetime) 


原文地址:https://www.cnblogs.com/sui84/p/6777018.html