Python-简单的爬虫语句

今天做一个简单的天气查询的程序，主要用到Urllib2（python自带的），和Json（Java Script Object Notation，JavaScript 对象表示法），安装步骤：

json的安装包在这里：https://www.lfd.uci.edu/~gohlke/pythonlibs/#simplejson，

打开cmd，进入到Python安装目录的Scripts文件夹中.比如：D:Program FilesPythonScripts。使用pip安装刚刚下载好的whl文件，pip.exe install *.whl，例如：

cd D:Program FilespythonScripts>
pip.exe install D:pythonsimplejson-3.10.0-cp36-cp36m-win_amd64.whl 




提示安装成功后，在PythonLibsite-packages目录下可以看到simplejson.

Urllib2用于获取网页的内容，Json用于对内容里的东西进行分析处理，

以下是一个简单的爬取语句：



import urllib2
web = urllib2.urlopen("http://www.sina.com")          #这里得加一个http://，不是直接写网页地址的
content = web.read()
print content
实例：
import urllib2
web = urllib2.urlopen("http://www.weather.com.cn/data/cityinfo/101200101.html")          #这里得加一个http://，不是直接写网页地址的
content = web.read()
print content
天气的查询是通过中国天气网（www.weather.com.cn）的信息获取完成的，点击http://www.weather.com.cn/data/cityinfo/101010100.html会发现101010100是北京的天气，这个网站是通过代号查询的，所以我们做一个城市与代号的字典，city.py，放在网盘中（https://pan.baidu.com/s/1c0Nw4m?errno=0&errmsg=Auth%20Login%20Sucess&&bduss=&ssnerror=0&traceid=），使用的时候只要放在和你的代码同一路径下，用

from city import city

前一个“city”是模块名，也就是py文件的名称，后一个“city”是模块中变量的名称。

我们分析http://www.weather.com.cn/data/cityinfo/101010100.html里的内容发现我们想要的内容都在里面，如果把101010100改成别的就变成了其他城市的天气，所以：



#python默认ASCII码，这一句是为了转换为UTF-8，不是注释同时city.py里也要声明
#在这个程序中第一行加了   # -*- coding: UTF-8 -*-     这句反而不行了，不知道为啥


import urllib2
import json   


city = {
    '北京': '101010100',
    '上海': '101020100',
    '天津': '101030100',
    '兰州': '101160101',
    '鄂州': '101200301',    
    '荆州': '101200801',
    '香港': '101320101',
    '新界': '101320103',
    '澳门': '101330101',
    '台北': '101340101',
    '高雄': '101340201',
    '台中': '101340401'
}


cityname = raw_input("The weather in which city do you want ?")
citycode = city.get(cityname)
print citycode             #测试代码，测试可行


url = ("http://www.weather.com.cn/data/cityinfo/%s.html"  %citycode)    #一定要注意%s
pagecontent = urllib2.urlopen(url).read()
print pagecontent
binggo。

得到如下数据：

{"weatherinfo":

{"city":"武汉",

"cityid":"101200101",

"temp1":"7℃",

"temp2":"19℃",

"weather":"小雨转多云",

"img1":"n7.gif",

"img2":"d1.gif",

"ptime":"18:00"}

}

接下来是分析的环节，我们发现这是嵌套的字典，我们只需要里面的temp1,2和weather的信息即可。

那么如何提取嗯?

现在我们需要借助json来完成了，可以先了解一下：http://www.w3school.com.cn/json/

import json

data = json.loads(pagecontent),这时候的data已经是一个字典，尽管在控制台中输出它，看上去和pagecontent没什么区别

这是编码上有区别：



{u'weatherinfo': {u'city': u'u5357u4eac', u'ptime': u'11:00', u'cityid': u'101190101', u'temp2': u'28u2103', u'temp1': u'37u2103', u'weather': u'u591au4e91', u'img2': u'n1.gif', u'img1': u'd1.gif'}}



但如果你用type方法看一下它们的类型：



print type(pagecontent)

print type(data)



就知道区别在哪里了。

import urllib2
import json


city = {
    "北京":"101010100",
    "武汉":"101200101"
    }
cityname = raw_input("which city?
")
citycode = city.get(cityname)
print citycode
print


if citycode:
    url = ("http://www.weather.com.cn/data/cityinfo/%s.html"  %citycode)
    print url
    print
    page = urllib2.urlopen(url).read()
    print page   #这里打印出来的东西里就有我们想要的东西了
    print

   #使用json
    data = json.loads(page)    #loads是json方法中的一种
    result = data["weatherinfo"]
    str_temp = ("%s	%s - %s") % (
        result["weather"],
        result["temp1"],
        result["temp2"]
        )
    print str_temp
else:
    print "Can not find this city."
就可以输出：

晴-2℃ - 16℃