我的Python之路:j简单网页爬虫

在这我们用Urllib快速爬取一个网页

.......................

#!/uer/bin/env python
#-*-coding: utf-8 -*-
import urllib.request ###导入import urllib.request模块
file=urllib.request.urlopen("http://www.baidu.com/")###打开并爬取相应的网页,爬取后赋给相应变量
#dat=file.read()###读取全部类容
#datline=file.readline()###读取第一行内容
#注意两种读取网页的方式
#urllib.request.urlcleanup()###可以清除一下缓冲信息,输出速度就会变快
#print(dat)
#print(datline)
#faan=open("D:python爬虫文件目录/baidu","wb")####爬取后保存在相应目录下,注意/baidu表示名称,wb表示用二进制写入
#faan.write(dat)
#faan.close()
ta=file.info()###返回与当前环境有关的信息
ta1=file.getcode()###返回当前状态200表示成功,其他表示失败
tp=file.geturl()###返回当前网页
print(ta)
print(ta1)
print(tp)
....................

1、C:Python36pythonw.exe D:/python3练习代码/Urllib库.py
b'<!DOCTYPE html> <!--STATUS OK--> <html> <head> <meta http-equiv="content-type" content="text/html;charset=utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <meta content="always" name="referrer"> <meta name="theme-color" content="#2932e1"> <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /> <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="xe7x99xbexe5xbaxa6xe6x90x9cxe7xb4xa2" /> <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu.svg"> <link rel="dns-prefetch" href="//s1.bdstatic.com"/> <link rel="dns-prefetch" href="//t1.baidu.com"/> <link rel="dns-prefetch" href="//t2.baidu.com"/> <link rel="dns-prefetch" href="//t3.baidu.com"/> <link rel="dns-prefetch" href="//t10.baidu.com"/> <link rel="dns-prefetch" href="//t11.baidu.com"/> <link rel="dns-prefetch" href="//t12.baidu.com"/> <link rel="dns-prefetch" href="//b1.bdstatic.com"/> <title>xe7x99xbexe5xbaxa6xe4xb8x80xe4xb8x8bxefxbcx8cxe4xbdxa0xe5xb0xb1xe7x9fxa5xe9x81x93</title> <style id="css_index" index="index" type="text/css">html,body{height:100%} html{overflow-y:auto} body{font:12px arial;text-align:;background:#fff} body,p,form,ul,li{margin:0;padding:0;list-style:none} body,form,#fm{position:relative} td{text-align:left} img{border:0} a{color:#00c} a:active{color:#f60} input{border:0;padding:0} #wrapper{position:relative;_position:;min-height:100%} #head{padding-bottom:100px;text-align:center;*z-index:1} #ftCon{height:50px;position:absolute;bottom:47px;text-align:left;100%;margin:0 auto;z-index:0;overflow:hidden} .ftCon-Wrapper{overflow:hidden;margin:0 auto;text-align:center;*640px} .qrcodeCon{text-align:center;position:absolute;bottom:140px;height:60px;100%} #qrcode{display:inline-block;*float:left;*margin-top:4px} #qrcode .qrcode-item{float:left} #qrcode .qrcode-item-2{margin-left:33px} #qrcode .qrcode-img{60px;height:60px} #qrcode .qrcode-item-1 .qrcode-img{background:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/zbios_efde696.png) 0 0 no-repeat} #qrcode .qrcode-item-2 .qrcode-img{background:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/nuomi_365eabd.png) 0 0 no-repeat} @media only screen and (-webkit-min-device-pixel-ratio:2){#qrcode .qrcode-item-1 .qrcode-img{background-image:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/zbios_x2_9d645d9.png);background-size:60px 60px} #qrcode .qrcode-item-2 .qrcode-img{background-image:url(http://s1.bdstatic.com/r/www/cache/static/home/img/qrcode/nuomi_x2_55dc5b7.png);background-size:60px 60px}} #qrcode .qrcode-text{color:#999;line-height:23px;margin:3px 0 0 5px} #qrcode .qrcode-text a{color:#999;

2、b''
3、Date: Wed, 19 Apr 2017 13:22:59 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: Close
Vary: Accept-Encoding
Set-Cookie: BAIDUID=63163489DEB125756CD4AB8A983EF41F:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=63163489DEB125756CD4AB8A983EF41F; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=1492608179; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=0; path=/
Set-Cookie: H_PS_PSSID=22583_1466_21125_21673_22074; path=/; domain=.baidu.com
P3P: CP=" OTI DSP COR IVA OUR IND COM "
Cache-Control: private
Cxy_all: baidu+26dc0e179821564f021cb338cbce2955
Expires: Wed, 19 Apr 2017 13:22:18 GMT
X-Powered-By: HPHP
Server: BWS/1.1
X-UA-Compatible: IE=Edge,chrome=1
BDPAGETYPE: 1
BDQID: 0xe85e276e00051c5d
BDUSERID: 0


4、200
5、http://www.baidu.com/

 

 

 

 

 

 

 

 


原文地址:https://www.cnblogs.com/alsely/p/6736006.html