R语言 网站数据获取 (rvest)——网络爬虫初学

都说Python爬虫功能强大,其实遇到动态加载或者登陆网站Python还是很困难,对于大部分的一些普通爬虫,R语言还是很方便。这里介绍R语言rvest包爬虫,主要用到函数:read_html()、html_nodes()、html_text()和html_attrs()。

rvest: Easily Harvest (Scrape) Web Pages  (轻松获取网页)

CRAN - Package rvest (r-project.org)

tidyverse/rvest: Simple web scraping for R (github.com)

首先,安装 rvest

install.packages("rvest")

  

安装好后就可以使用了

library(rvest)

  

函数作用
read_html() 读取 html 页面
html_nodes() 提取所有符合条件的节点
html_node() 返回一个变量长度相等的list,相当于对html_nodes()[[1]]操作
html_table() 获取 table 标签中的表格,默认参数trim=T,设置header=T可以包含表头,返回数据框
html_text() 提取标签包含的文本,令参数trim=T,可以去除首尾的空格
html_attrs(nodes) 提取指定节点所有属性及其对应的属性值,返回list
html_attr(nodes,attr) 提取节点某个属性的属性值
html_children() 提取某个节点的孩子节点
html_session() 创建会话

举例参考:

1、上证综指成份股列表爬取

网站: 上海证券交易所_上证综合指数成分股列表 (sse.com.cn)    http://www.sse.com.cn/market/sseindex/indexlist/s/i000001/const_list.shtml

 

利用Chrome浏览器的功能先获取表格所在页面部分的xpath, 办法是鼠标右键单击表格开头部分, 选择“检查”(inspect), 这时会在浏览器右边打开一个html源代码窗口,

当前加亮显示部分是表格开头内容的源代码,将鼠标单击到上层的<table class="tablestyle">处, 右键单击选择“Copy-Copy XPath”, 得到如下的xpath地址:'//*[@id="content_ab"]/div[1]/table'

然后, 用rvest的 html_nodes()函数提取页面中用xpath指定的成分, 用 html_table()函数将HTML表格转换为数据框, 结果是一个数据框列表, 因为仅有一个, 所以取列表第一项即可。 

library(rvest)

## 网页地址
urlb <- "http://www.sse.com.cn/market/sseindex/indexlist/s/i000001/const_list.shtml"
## 网页中数据表的xpath xpath <- '//*[@id="content_ab"]/div[1]/table' ## 读入网页并提取其中的表格节点 nodes <- html_nodes( read_html(urlb), xpath=xpath) ## 从表格节点转换为表格列表 tables <- html_table(nodes) restab <- tables[[1]] head(restab) ## X1 X2 ## 1 浦发银行 (600000) 白云机场 (600004) ## 2 中国国贸 (600007) 首创股份 (600008) ## X3 ## 1 东风汽车 (600006) ## 2 上海机场 (600009)

  

可见每一行有三个股票, 我们将数据中的 和空格去掉, 然后转换成名称与代码分开的格式:

library(tidyverse)

pat1 <- "^(.*?)\((.*?)\)"
tab1 <- restab %>%
  ## 将三列合并为一列,结果为字符型向量
  reduce(c) %>% 
  ## 去掉空格和换行符,结果为字符型向量
  stringr::str_replace_all("[[:space:]]", "") %>%
  ## 提取公司简称和代码到一个矩阵行,结果为字符型矩阵
  stringr::str_match(pat1) 
tab <- tibble(
  name = tab1[,2],
  code = tab1[,3])
head(tab)
## # A tibble: 6 x 2
##   name     code  
##   <chr>    <chr> 
## 1 浦发银行 600000
## 2 中国国贸 600007
## 3 包钢股份 600010
## 4 华夏银行 600015
## 5 上港集团 600018
## 6 上海电力 600021

  

str(tab)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1551 obs. of  2 variables:
## $ name: chr  "浦发银行" "中国国贸" "包钢股份" "华夏银行" ...
##  $ code: chr  "600000" "600007" "600010" "600015" ...

  

 

 对于不符合规则的网页, 可以用download.file()下载网页文件, 用str_replace_all()或者gsub()去掉不需要的成分。 用str_which()或者grep查找关键行。

 

2、开始爬取IMDB上2016年度最流行的100部故事片

Feature Film, Released between 2016-01-01 and 2016-12-31 (Sorted by Popularity Ascending) - IMDb

# 加载包
library('rvest')

# 指定要爬取的url
url <- 'https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature'

# 从网页读取html代码
webpage <- read_html(url)

  

# 用CSS选择器获取排名部分
rank_data_html <- html_nodes(webpage,'.text-primary')

# 把排名转换为文本
rank_data <- html_text(rank_data_html)

# 检查一下数据
head(rank_data)

[1] "1." "2." "3." "4." "5." "6."

  

# 数据预处理:把排名转换为数值型
rank_data<-as.numeric(rank_data)

# 再检查一遍
head(rank_data)

[1] 1 2 3 4 5 6

  

# 爬取标题
title_data_html <- html_nodes(webpage,'.lister-item-header a')

# 转换为文本
title_data <- html_text(title_data_html)

# 检查一下
head(title_data)

[1] "Sing"          "Moana"         "Moonlight"     "Hacksaw Ridge"
[5] "Passengers"    "Trolls"

  

# 爬取描述
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

# 转为文本
description_data <- html_text(description_data_html)

# 检查一下
head(description_data)

[1] "
In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

[2] "
In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to seek out the Demigod to set things right."

[3] "
A chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "
WWII American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot."

[5] "
A spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early."

[6] "
After the Bergens invade Troll Village, Poppy, the happiest Troll ever born, and the curmudgeonly Branch set off on a journey to rescue her friends.

# 移除 '
'
description_data<-gsub("
","",description_data)

# 再检查一下
head(description_data)

[1] "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

[2] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to seek out the Demigod to set things right."

[3] "A chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "WWII American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot."

[5] "A spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early."

[6] "After the Bergens invade Troll Village, Poppy, the happiest Troll ever born, and the curmudgeonly Branch set off on a journey to rescue her friends."

# 爬取runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

# 转为文本
runtime_data <- html_text(runtime_data_html)

# 检查一下
head(runtime_data)

[1] "108 min" "107 min" "111 min" "139 min" "116 min" "92 min"

# 数据预处理: 去除“min”并把数字转换为数值型

runtime_data <- gsub(" min","",runtime_data)
runtime_data <- as.numeric(runtime_data)

# 再检查一下
head(rank_data)

[1] 1 2 3 4 5 6

# 爬取genre
genre_data_html <- html_nodes(webpage,'.genre')

# 转为文本
genre_data <- html_text(genre_data_html)

# 检查一下
head(genre_data)

[1] "
Animation, Comedy, Family "

[2] "
Animation, Adventure, Comedy "

[3] "
Drama "

[4] "
Biography, Drama, History "

[5] "
Adventure, Drama, Romance "

[6] "
Animation, Adventure, Comedy "

# 去除“
”
genre_data<-gsub("
","",genre_data)

# 去除多余空格
genre_data<-gsub(" ","",genre_data)

# 每部电影只保留第一种类型
genre_data<-gsub(",.*","",genre_data)

# 转化为因子
genre_data<-as.factor(genre_data)

# 再检查一下
head(genre_data)

[1] Animation Animation Drama     Biography Adventure Animation

  

# 爬取IMDB rating
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

# 转为文本
rating_data <- html_text(rating_data_html)

# 检查一下
head(rating_data)

[1] "7.2" "7.7" "7.6" "8.2" "7.0" "6.5"

# 转为数值型
rating_data<-as.numeric(rating_data)

# 再检查一下
head(rating_data)

[1] 7.2 7.7 7.6 8.2 7.0 6.5

# 爬取votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

# 转为文本
votes_data <- html_text(votes_data_html)

# 检查一下
head(votes_data)

[1] "40,603"  "91,333"  "112,609" "177,229" "148,467" "32,497"

# 移除“,”
votes_data<-gsub(",", "", votes_data)

# 转为数值型
votes_data<-as.numeric(votes_data)

# 再检查一下
head(votes_data)

[1]  40603  91333 112609 177229 148467  32497

# 爬取directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

# 转为文本
directors_data <- html_text(directors_data_html)

# 检查一下
head(directors_data)

[1] "Christophe Lourdelet" "Ron Clements"         "Barry Jenkins"
[4] "Mel Gibson"           "Morten Tyldum"        "Walt Dohrn"

# 转为因子
directors_data<-as.factor(directors_data)

# 爬取actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

# 转为文本
actors_data <- html_text(actors_data_html)

# 检查一下
head(actors_data)

[1] "Matthew McConaughey" "Auli'i Cravalho"     "Mahershala Ali"
[4] "Andrew Garfield"     "Jennifer Lawrence"   "Anna Kendrick"

# 转为因子
actors_data<-as.factor(actors_data)

  

# 爬取metascore section
metascore_data_html <- html_nodes(webpage,'.metascore')

# 转为文本
metascore_data <- html_text(metascore_data_html)

# 检查一下
head(metascore_data)

[1] "59        " "81        " "99        " "71        " "41        "
[6] "56        "

# 去除多余空格
metascore_data<-gsub(" ","",metascore_data)

# 检查metascore data的长度
length(metascore_data)

[1] 96

  

爬取好数据后,你们队数据进行一些分析与推断,训练一些机器学习模型。我在上面这个数据集的基础上做了一些有趣的可视化来回答下面的问题。

library('ggplot2')
qplot(data = movies_df,Runtime,fill = Genre,bins = 30)

  

ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))

  

ggplot(movies_df,aes(x=Runtime,y=Gross_Earning_in_Mil))+
geom_point(aes(size=Rating,col=Genre))

  

图灵社区 (ituring.com.cn)

Beginner’s Guide on Web Scraping in R (using rvest) with example (analyticsvidhya.com)

原文地址:https://www.cnblogs.com/adam012019/p/14862610.html