优达学城数据分析师纳米学位——P4项目知识点整理及代码分析

#P4项目知识点整理

##P4项目概述

##R语言知识点汇总

##数据分析实例解析

#P4项目概述

使用R+EDA(exploratory data analysis探索性数据分析)(探索式数据分析是在应用正式的、严格的统计分析之前,对数据的特征和关系进行数字和图表的测试) 来探索一个变量或多个变量之间的关系,以及在一个选定的数据集中探索分布,异常值和反常现象。

#R语言知识点汇总 

1.R语言概述

R语言是一款强大,免费,扩展性高的开源编程语言,用于统计计算,同时运用了command-line scripting, you can store a series of complex data-analysis steps in R.

Re-use your data analysis work

make it easier for others to validate research results and check your work for errors 

The language is actually fairly simple, but it is unconventional

2.数据处理:

###ggplot2 - Multiple Plots in One graph using gridExtra

区分 facet_wrap facet_grid 命令将数据分面在不同的数据表中显示,gridExtra可以在同一张表格中显示不同的数据

 

 ###生成有序变量 factor variables

http://statistics.ats.ucla.edu/stat/r/modules/factor_variables.htm

3.数据转换 (data transformation)

log transformation

1.用于Monetary amounts--incomes, customer value, account, purchase sizes

basic data work

2.用于多个数量级的数据

3.用于倍增特征的数据 例如涨价 2% 需要根据原价调整,范围可能是2,可能是200,可能是20000

signedlog 10 = function(x) {
ifelse(abs(x)<=1, 0, sign(x)*log10(abs(x))) }

extracting key statistics out of a data set

explore a data set with basic graphics

reshape data to make it easier to analyze

4400+的数据包,18000+的领英小组 

R的语言 is different from that of many other languages

##数据分析实例解析

Netflix Prize

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films.

训练数据集 <user, movie, data of grade, grade>  

RMSE(root mean squared error) measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed

qualifying set表示包含了 user, movie, date of grade 三个变量的数据集, 其中quiz set用来做预测算法的检验工作

提高推荐算法准确率

Foodborne Chicago finds dodgy restaurants with tweets, and R

http://blog.revolutionanalytics.com/2013/08/foodborne-chicago.html 

原文地址:https://www.cnblogs.com/kong-xy/p/6366647.html