python爬虫---->github上python的项目

  这里面通过爬虫github上的一些start比较高的python项目来学习一下BeautifulSoup和pymysql的使用。我一直以为山是水的故事,云是风的故事,你是我的故事,可是却不知道,我是不是你的故事。

github的python爬虫

爬虫的需求:爬取github上有关python的优质项目,以下是测试用例,并没有爬取很多数据。

一、实现基础功能的爬虫版本

这个案例可以学习到关于pymysql的批量插入、使用BeautifulSoup解析html数据以及requests库的get请求数据的知识。至于pymysql的一些使用,可以参考博客:python框架---->pymysql的使用

import requests
import pymysql.cursors
from bs4 import BeautifulSoup

def get_effect_data(data):
    results = list()
    soup = BeautifulSoup(data, 'html.parser')
    projects = soup.find_all('div', class_='repo-list-item')
    for project in projects:
        writer_project = project.find('a', attrs={'class': 'v-align-middle'})['href'].strip()
        project_language = project.find('div', attrs={'class': 'd-table-cell col-2 text-gray pt-2'}).get_text().strip()
        project_starts = project.find('a', attrs={'class': 'muted-link'}).get_text().strip()
        update_desc = project.find('p', attrs={'class': 'f6 text-gray mb-0 mt-2'}).get_text().strip()

        result = (writer_project.split('/')[1], writer_project.split('/')[2], project_language, project_starts, update_desc)
        results.append(result)
    return results


def get_response_data(page):
    request_url = 'https://github.com/search'
    params = {'o': 'desc', 'q': 'python', 's': 'stars', 'type': 'Repositories', 'p': page}
    resp = requests.get(request_url, params)
    return resp.text


def insert_datas(data):
    connection = pymysql.connect(host='localhost',
                                 user='root',
                                 password='root',
                                 db='test',
                                 charset='utf8mb4',
                                 cursorclass=pymysql.cursors.DictCursor)
    try:
        with connection.cursor() as cursor:
            sql = 'insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)'
            cursor.executemany(sql, data)
            connection.commit()
    except:
        connection.close()


if __name__ == '__main__':
    total_page = 2 # 爬虫数据的总页数
    datas = list()
    for page in range(total_page):
        res_data = get_response_data(page + 1)
        data = get_effect_data(res_data)
        datas += data
    insert_datas(datas)

运行完之后,可以在数据库中看到如下的数据:

11 tensorflow tensorflow C++ 78.7k Updated Nov 22, 2017
12 robbyrussell oh-my-zsh Shell 62.2k Updated Nov 21, 2017
13 vinta awesome-python Python 41.4k Updated Nov 20, 2017
14 jakubroztocil httpie Python 32.7k Updated Nov 18, 2017
15 nvbn thefuck Python 32.2k Updated Nov 17, 2017
16 pallets flask Python 31.1k Updated Nov 15, 2017
17 django django Python 29.8k Updated Nov 22, 2017
18 requests requests Python 28.7k Updated Nov 21, 2017
19 blueimp jQuery-File-Upload JavaScript 27.9k Updated Nov 20, 2017
20 ansible ansible Python 26.8k Updated Nov 22, 2017
21 justjavac free-programming-books-zh_CN JavaScript 24.7k Updated Nov 16, 2017
22 scrapy scrapy Python 24k Updated Nov 22, 2017
23 scikit-learn scikit-learn Python 23.1k Updated Nov 22, 2017
24 fchollet keras Python 22k Updated Nov 21, 2017
25 donnemartin system-design-primer Python 21k Updated Nov 20, 2017
26 certbot certbot Python 20.1k Updated Nov 20, 2017
27 aymericdamien TensorFlow-Examples Jupyter Notebook 18.1k Updated Nov 8, 2017
28 tornadoweb tornado Python 14.6k Updated Nov 17, 2017
29 python cpython Python 14.4k Updated Nov 22, 2017
30 reddit reddit Python 14.2k Updated Oct 17, 2017

友情链接

原文地址:https://www.cnblogs.com/huhx/p/usepythongithubspider.html