数据分析进阶更复杂的数据格式处理

目的:练习BeautifulSoup和xml的在清洗数据的使用

练习1:获取航空公司列表

描述:根据给定的excel文件获取一个包含所有航空公司的列表。在你所返回的数据中要去掉所有类似 “All U.S. Carriers” 的组合。最终你应该返回一个含有运营商编码的列表

from bs4 import BeautifulSoup
html_page = "options.html"


def extract_carriers(page):
    data = []#存放结果数据
    with open(page,'r') as html:
        #使用BeautifulSoup解析html文件,并且使用默认的lxml引擎
        soup = BeautifulSoup(html,'lxml')
        #根据标签的id=CarrierList找到对应的元素(select标签)
        carryList = soup.find(id='CarrierList')
        #找到select标签下的所有option元素
        options = carryList.find_all('option')
        for option in options:
            #如果option的值包含All,去除 
            if not option['value'].startswith('All'):
                #将符合条件的值加入到列表,然后进行返回
                data.append(option['value'])
        return data

def test():
    #测试函数,如果assert不出错,则代码符合要求
    data = extract_carriers(html_page)
    assert len(data) == 16
    assert "FL" in data
    assert "NK" in data
   
if __name__ == "__main__":
    test()

练习2:获取机场列表

描述:根据给定的excel文件获取一个包含所有机场的列表。在你所返回的数据中要去掉所有类似 “All U.S. Carriers” 的组合。最终你应该返回一个含有机场编码的列表

该题的解提思路同上,不做详细描述

def extract_airports(page):
    data = []
    with open(page,'r') as html:
        soup = BeautifulSoup(html,'lxml')
        airportList = soup.find(id='AirportList')
        options = airportList.find_all('option')
        for option in options:
            if not option['value'].startswith('All'):
                data.append(option['value'])
        return data

def test():
    data = extract_airports(html_page)
    assert len(data) == 15
    assert "ATL" in data
    assert "ABR" in data

if __name__ == "__main__":
    test()

练习3:处理所有数据

描述:根据一个航班信息的html的表格具有一个表格类“dataTDRight”从该表格中提取航班数据，并作为字典列表，每个字典包含文件中的相关信息和表格行。以下是你应该返回的数据结构示例：

data = [
     {"courier": "FL",
         "airport": "ATL",
         "year": 2012,
         "month": 12,
         "flights": {"domestic": 100,
                     "international": 100}
        },
         {"courier": "..."}
]

思路:1.通过beautifulsoup来解析html文件

2.构建如上的数据格式

3.根据class='dataTDRight'找到html中的元素,同时找到下面的tr,td并获取值

4.提取满足条件的值,放到定义好的字典中,同时加入到列表中返回即可

from bs4 import BeautifulSoup
import os
datadir = "data"


def process_all(datadir):
    #根据传入的路径,列举并返回文件对象
    files = os.listdir(datadir)
    return files

def process_file(f):
    data = [] #存放结果数据
    with open("{}/{}".format(datadir, f),"r") as html:
        soup=BeautifulSoup(html,'lxml')
        #构建字典 
        entry = {"courier": "",
             "airport": "",
             "year":'' ,
             "month": '',
             "flights": {"domestic": '',
                         "international": ''}
            }
        #将文件中的值填充到字典中  
        entry['courier'],entry['airport'] = f[:6].split("-")
        #根据html标签和class名称找到所要提取的元素
        table_data = soup.find('table',{'class':'dataTDRight'})
        #找到table下所有的tr 
        for tr in table_data.find_all('tr'):
            #找到tr下所有的td
            td = tr.find_all('td')
            #如果年月和航班数据应该是整型。你应该跳过包含一年 TOTAL 数据的行 
            if td[1].text == 'Month' or td[1].text == 'TOTAL':
                continue
            else:
                #将td中的数据依次填入到字典中,
                entry['year'] = int(td[0].text)
                entry['month'] = int(td[1].text)
                entry['flights']['domestic'] = int(td[2].text.replace(',',''))
                entry['flights']['international'] = int(td[3].text.replace(',',''))
            #把字典加入到列表中,然后进行返回
            data.append(entry)
    return data

def test():
    #测试函数,调用process_all()函数,然后和某些值进行对比,如果不出错,说明代码符合要求
    print "Running a simple test..."
    files = process_all(datadir)
    data = []
    # Test will loop over three data files.
    for f in files:
        data += process_file(f)
        
    assert len(data) == 399  # Total number of rows
    for entry in data[:3]:
        assert type(entry["year"]) == int
        assert type(entry["month"]) == int
        assert type(entry["flights"]["domestic"]) == int
        assert len(entry["airport"]) == 3
        assert len(entry["courier"]) == 2
    assert data[0]["courier"] == 'FL'
    assert data[0]["month"] == 10
    assert data[-1]["airport"] == "ATL"
    assert data[-1]["flights"] == {'international': 108289, 'domestic': 701425}
    
    print "... success!"

练习四:专利数据库

描述:运行给定代码,查看报错信息,分析报错的原因

import xml.etree.ElementTree as ET

PATENTS = 'patent.data'

def get_root(fname):

    tree = ET.parse(fname)
    return tree.getroot()


get_root(PATENTS)

报错信息

File "<string>", line unknown
ParseError: junk after document element: line 657, column 0

报错信息显示在该文件的657行又有一个<?xml>标签开头,所以分析错误为重复定义的行

练习五:处理专利

描述:

　　1.该文件不是有效的xml文件,因为其包含多个头部声明<?xml version="1.0" encoding="UTF-8"?>

2.根据xml声明的头将文件拆封成多个独立有效的xml文件

思路:

　　1.要把整体文件 A 拆分成单独的子文件(A1,A2,A3......)

2. 在打开A的同时,打开文件根据序号和文件名以w模式打开句柄

3.循环文件A,如果在A文件中遇到<?xml开头的字符串,则将子文件关闭,同时文件序号+1,并且开启一个新的子文件句柄来进行写操作

4.写子文件

5.当A文件循环完毕,关闭子文件

PATENTS = 'patent.data' #目标文件(A)


def split_file(filename):
    #打开文件A,作为待修改的数据文件
    with open(filename,'r') as input_file:
        file_num = -1 #子文件的编号
        #子文件命名的方式,使用'文件名-文件编号'组成,并且以w模式打开文件句柄
        output_file = open('{}-{}'.format(filename,file_num),'w')
        for line in input_file: #循环A文件
            #如果A文件中遇到声明行
            if line.startswith('<?xml'):
                #将子文件关闭,同时文件序号+1,并且开启一个新的文件句柄来等待写入
                output_file.close()
                file_num +=1
                output_file = open('{}-{}'.format(filename,file_num),'w')
            #如果没有遇到声明的行,则将A文件的数据写入到子文件中  
            output_file.write(line)
        #循环结束,关闭子文件
        output_file.close()

def test():
    #测试函数
    split_file(PATENTS)
    for n in range(4):
        try:
            fname = "{}-{}".format(PATENTS, n)
            f = open(fname, "r")
            if not f.readline().startswith("<?xml"):
                print "You have not split the file {} in the correct boundary!".format(fname)
            f.close()
        except:
            print "Could not find file {}. Check if the filename is correct!".format(fname)

if __name__ == "__main__":
    test()

数据分析进阶 更复杂的数据格式处理

数据分析进阶更复杂的数据格式处理