Urllib--爬虫

1.简单爬虫

from urllib import request

def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url) #赋给一个实例,请求
    data = resp.read() #把结果读出来
    f=open('url.html','wb')
    f.write(data)
    f.close()
    print('%d bytes received from %s.' % (len(data), url))

f('http://www.cnblogs.com/alex3714/articles/5248247.html')

运行结果：

C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
GET: http://www.cnblogs.com/alex3714/articles/5248247.html
91829 bytes received from http://www.cnblogs.com/alex3714/articles/5248247.html.

Process finished with exit code 0

2.爬多个网页

from urllib import request
import gevent

def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url) #赋给一个实例,请求
    data = resp.read() #把结果读出来
    print('%d bytes received from %s.' % (len(data), url))

#启动3个协程并且传参数
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.yahoo.com/'),
        gevent.spawn(f, 'https://github.com/'),
])

运行结果：

GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
479631 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55394 bytes received from https://github.com/.

Process finished with exit code 0

3.测试运行时间：

from urllib import request
import gevent
import time

def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url) #赋给一个实例,请求
    data = resp.read() #把结果读出来
    print('%d bytes received from %s.' % (len(data), url))

start_time=time.time()
#启动3个协程并且传参数
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.yahoo.com/'),
        gevent.spawn(f, 'https://github.com/'),
])
print('cost is %s:'%(time.time()-start_time))

运行结果：通过时间看到也是串行运行的。gevent默认检测不到 urllib 进行的是否是io操作。

C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
488624 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55394 bytes received from https://github.com/.
cost is 4.5304529666900635:

Process finished with exit code 0

4.同步与异步的时间比较：

from urllib import request
import gevent
import time
#from gevent import monkey

#monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记
def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url) #赋给一个实例,请求
    data = resp.read() #把结果读出来
    print('%d bytes received from %s.' % (len(data), url))

urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']
start_time=time.time()
for url in urls:
    f(url)
print('同步cost is %s:'%(time.time()-start_time))


async_time_start=time.time() #异步的起始时间
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.yahoo.com/'),
        gevent.spawn(f, 'https://github.com/'),
])
print('异步cost is %s:'%(time.time()-async_time_start))

运行时间：几乎差不多，看不出异步的优势。

C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
480499 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55394 bytes received from https://github.com/.
同步cost is 7.112711191177368:
GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
485666 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55390 bytes received from https://github.com/.
异步cost is 4.510450839996338:

Process finished with exit code 0

5.因为gevent默认检测不到 urllib 进行的是否是io操作。要想让两者关联起来，需要再导入一个新函数（补丁）

from gevent import monkey，

monkey.patch_all()

from urllib import request
import gevent
import time
from gevent import monkey

monkey.patch_all() #把当前程序的所有io操作给我单独地做上标记
def f(url):
    print('GET: %s' % url)
    resp = request.urlopen(url) #赋给一个实例,请求
    data = resp.read() #把结果读出来
    print('%d bytes received from %s.' % (len(data), url))

urls=['https://www.python.org/','https://www.yahoo.com/','https://github.com/']
start_time=time.time()
for url in urls:
    f(url)
print('同步cost is %s:'%(time.time()-start_time))


async_time_start=time.time() #异步的起始时间
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.yahoo.com/'),
        gevent.spawn(f, 'https://github.com/'),
])
print('异步cost is %s:'%(time.time()-async_time_start))

运行结果：

C:abccdxdddOldboypython-3.5.2-embed-amd64python.exe C:/abccdxddd/Oldboy/Py_Exercise/Day10/爬虫.py
GET: https://www.python.org/
48751 bytes received from https://www.python.org/.
GET: https://www.yahoo.com/
487577 bytes received from https://www.yahoo.com/.
GET: https://github.com/
55392 bytes received from https://github.com/.
同步cost is 5.784578323364258:
GET: https://www.python.org/
GET: https://www.yahoo.com/
GET: https://github.com/
480662 bytes received from https://www.yahoo.com/.
48751 bytes received from https://www.python.org/.
55394 bytes received from https://github.com/.
异步cost is 1.8721871376037598:

Process finished with exit code 0