Screen scraping 3

Use BeautifulSoup

from urllib import urlopen
from bs4 import BeautifulSoup as BS

text = urlopen("http://www.python.org/community/jobs/").read()
soup = BS(text.decode('gbk', 'ignore'))

jobs = set()
for header in soup('h2'):
    links = header('a', 'reference')
    if not links:
        continue
    link = links[0]
    jobs.add('%s (%s)' % (link.string, link['href']))
        
print '\n'.join(sorted(jobs, key = lambda s: s.lower()))
eliminate duplicates and print the names in sorted order

soup('h2'): to get a list of all h2 elements
header('a', 'reference') to get a list of child elements of the reference class

作者：Shane
出处：http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。