第一次爬虫和测试

Python测试函数的方法之一是用：try……except

def gameover(a,b):
    if a>=10 and b>=10 and abs(a-b)==2:
        return True
    if (a>=11 and b<11) or (a<11 and b>=11):
        return True
    return False
try:
    a=gameover(10,11)
    print(a)
except:
    print("Error")

gameover测试的函数，没传参数的a,b，函数结果是True or False

try：试着执行gameover()函数，正常就执行函数

except:否则打印'Error'

这里用10，11这一对来测试，结果为：

runfile('D:/新建文件夹/chesi.py', wdir='D:/新建文件夹')
True

程序运行正常且结果正确

若不输入参数，结果应为Error,结果为：

requests库是一个简洁且简单的处理HTTP请求的第三方库。

get()是对应与HTTP的GET方式，获取网页的最常用方法，可以增加timeout=n 参数，设定每次请求超时时间为n秒

text（）是HTTP相应内容的字符串形式，即url对应的网页内容

content（）是HTTP相应内容的二进制形式

用requests（）打开搜狗20次

from requests import *
try:
    for i in range(20):
        r=get("https://www.sogou.com/")
        r.raise_for_status()
        r.encoding='utf-8'
        print(r)
    print(len(r.text))
    print(len(r.content))
except:
    print("Error")

结果为：

用 Beautifulsoup4 库提取网页源代码中的有效信息

下面是本次操作所访问的网页源代码：

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>菜鸟教程（runoob.com)</title> 
</head>
<body>
         <hl>我的第一个标题</hl>
         <p id="first">我的第一个段落。</p> 
</body>
                  <table border="1">
          <tr>
                  <td>row 1, cell 1</td> 
                  <td>row 1, cell 2</td> 
         </tr>
         <tr>
                  <td>row 2, cell 1</td>
                  <td>row 2, cell 2</td>
         <tr>
</table>
</html>

注意：对于中文网页需要使用 <meta charset="utf-8"> 声明编码，否则会出现乱码。

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.baidu.com")
r.encoding="utf-8"
soup = BeautifulSoup(r.text)
print("head标签内容:
",soup.head,"
")
print("body标签内容:
",soup.body,"
")
a=soup.find_all('a')
print(soup.a.string,"
")