Python for Infomatics 第12章网络编程六（译）

注：文章原文为Dr. Charles Severance 的《Python for Informatics》。文中代码用3.4版改写，并在本机测试通过。

12.9 词汇表

BeautifulSoup: 一个用于分析HTML文档，并从中抓取数据的Python库。它弥补了大部分在浏览器中被忽略的HTML缺陷。你可以从www.crummy.com下载BeautifulSoup代码。

port：端口。当你用套接字链接服务器，通常表示正在联系的的服务器应用程序的数字。例如，网页服务使用80端口，电子邮件服务使用25端口。

scrape：一个程序伪装成一个网页浏览器，获取一个页面，然后查看网页的内容。经常程序会跟随一个页面中链路去找到下个页面，这样它们可以穿越一个网页网络或社交网络。

socket：套接字。两个应用程序之间的网络连接。这样程序可以双向发送和接收数据。

spider：网络爬虫。网页搜索引擎通过获取一个页面和此页面的所有链接，循环搜索至几乎拥有互联网所有页面，并据此建立搜索索引的一种行为。

12.10 练习

以下练习代码均为译者编写，仅供参考

练习 12.1 修改socket1.py，提示用户输入URL，使程序可以读取任何网页。你可以用split('/')方法分解URL的组成部门，使你可以抽取套接字连接调用的主机名。使用try和except语句添加错误校验，处理用户输入不正确格式的或不存在的URL。

import socket
import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py
')
if (re.search('^http://[a-zA-Z0-9]+.[a-zA-Z0-9]+.[a-zA-Z0-9]+/',url)):
    words = url.split('/')
    hostname = words[2]
    mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        mysocket.connect((hostname, 80)) # 注意是两个圆括号
    except:
        print(hostname, ' is not a correct web server')
        exit
    mysocket.send(str.encode('GET ' + url + ' HTTP/1.0

'))
    while True:
        data = mysocket.recv(1024).decode('utf-8')
        if (len(data) < 1): 
            break
        print (data)
    mysocket.close()
else:
    print("The URL that you input is bad format")

练习12.2 修改你的socket程序，使它具备对接收的字符进行计数的功能，并在显示3000个字符后停机显示。程序应该获取整个文档，对所有字符进行计数，并在文档最后显示字符数。

import socket
import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py
')
if (re.search('^http://[a-zA-Z0-9]+.[a-zA-Z0-9]+.[a-zA-Z0-9]+/',url)):
    words = url.split('/')
    hostname = words[2]
    mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        mysocket.connect((hostname, 80)) # 注意是两个圆括号
    except:
        print(hostname, ' is not a correct server')
        exit
    mysocket.send(str.encode('GET ' + url + ' HTTP/1.0

'))
    count = 0
    while True:
        data = mysocket.recv(3000).decode('utf-8')
        if (len(data) < 1): 
            break
        count = count + len(data)
        if (count <= 3000):
            print (data)
    print("The total count of this web is", count)
    mysocket.close()
    
else:
    print("The URL that you input is bad format")

练习12.3 使用urllib库复制先前练习中的功能。（1）通过URL获取文档。（2）最多显示3000个字符。（3）对整个文档进行计数。不要担心这个练习的文件头，只需简单显示文档内容的前3000个字符。

import urllib.request
import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py
')

if (re.search('^http://[a-zA-Z0-9]+.[a-zA-Z0-9]+.[a-zA-Z0-9]+/',url)):
    try:
        web = urllib.request.urlopen(url)
    except:
        print(url, ' is not a valid url')
        exit

    counts = 0
    while True:
        data = web.read(3000)
        if (len(data) < 1): 
            break
        counts = counts + len(data)
        if (counts <= 3000):
            print (data.decode('utf-8'))
    print("The total counts of this web is", counts)
    
else:
    print("The URL that you input is bad format")

练习12.4 修改urllinks.py程序，使它抽取和统计所获取的HTML文档中的段标签（p），并显示段标签的数量。不需显示段的内容，只是统计即可。分别在几个小网页和一些长网页上测试你的程序。

from bs4 import BeautifulSoup
import urllib.request

url = input('Enter - ')
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags = soup('p')
counts = 0
for tag in tags:
    counts = counts + 1
print('This web has ',counts, ' tags of p.')

练习12.5（高级）修改socket程序，使它只显示文件头和空行之后的数据。切记recv只接收字符（换行符及所有），而不是行。

import socket
import re

url = input('Enter an URL like this: http://www.py4inf.com/code/socket1.py
')
if (re.search('^http://[a-zA-Z0-9]+.[a-zA-Z0-9]+.[a-zA-Z0-9]+/',url)):
    words = url.split('/')
    hostname = words[2]

    mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    try:
        mysocket.connect((hostname, 80)) # 注意是两个圆括号
    except:
        print(hostname, ' is not a correct server')
        exit

    mysocket.send(str.encode('GET ' + url + ' HTTP/1.0

'))
    web = b''
    while True:
        data = mysocket.recv(1024)
        if (len(data) < 1): 
            break
        web = web + data
    mysocket.close()

    pos = web.find(b'

')
    print(web[pos+4:].decode('utf-8'))
else:
    print("The URL that you input is bad format")

Python for Infomatics 第12章 网络编程六（译）

Python for Infomatics 第12章网络编程六（译）