COMP9313 Week8 Classification and PySpark MLlib

https://drive.google.com/drive/folders/13_vsxSIEU9TDg1TCjYEwOidh0x3dU6es

https://www.cse.unsw.edu.au/~cs9313/20T2/slides/L7.pdf

Machine Learning :

  1. Construct a model, predicting new data

  2. 

Evaluation Matrix:

  Positive/Negative: Label ∈{a,b,c,d} 选择a为positive,则其他都是negative

  False Positive:  not a but classified as a

  False Negative: a but classified as b or c or d

  True Positive : a and classified as a

  

  Precision = tp / tp+fp

  Recall = tp / tp+fn

  F1 = 2 * precision*recall / ( precision + recall)

  Micro:   True label 是 positive  

  Macro:  mean of F1 of each class label

Classification:

  1. Preprocessing and Feature Engineering 

    1) bag of words 

    2) 去高频词

    

  2.  Train classifier

  3. Evaluate the classifier

    1) split a 'development set' from the training set 

    2)   k-fold cross-validation, 然后取 avg(accuracy)

      

Text Classification:

  1. Input •Document or sentence

  2.  •Output •Class label C ∈ {c1, c2, … }

  3. Classification methods:

     •Naïve bayes

    •Logistic regression

    •Support-vector machines •…

  4. Naïve Bayes

    1) bag of words -> features变成d维向量,label为c

    2) 最大后验概率

    3)假设条件独立。假设位置无关

    4)  

    

  

  

   

PySpark MLlib:

   

原文地址:https://www.cnblogs.com/ChevisZhang/p/13344371.html