数据挖掘导论-1

Classification [Predictive]
Clustering  [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]

categorical/qualitative
1) nominal:
mode众数
entropy熵
contingency correlation列联相关
x,2-test卡方检验

2) Ordinal: median/percentiles/rank correlation/
run tests游程检验
sign test符号检验

numeric/quantitative

3) Interval:
mean/standard deviation/Pearson's correlation/t and F tests
4) Ratio:
geometric mean/harmonic mean/percent variation百分比变差


 data quality problems:

1) Noise and outliers
2) missing values
why: 1. info not collected; 2. attributes not applicable for all
how: 1. eliminate data objects; 2. estimate missing values; 3. Ignore missing values during analysis; 4. replace with all possible values(weighted by probabilities)
3) duplicate data


data preprocessing:
1) aggregation
2) sampling
3) dimensionality reduction
curse of dimensionality: dimensionality↑sparse↑,density & distance meaningful↓
how: Principle Component Analysis; Singular Value Decomposition
4) feature subset selection

5) feature creation

feature extraction: domain-specific
mapping data to new space: Fourier transform/Wavelet transform
feature construction: combining features

6) discretization and binarization
7) attribute transformation





Euclidean density = number of points per unit volume

原文地址:https://www.cnblogs.com/pxy7896/p/6493064.html