TF-IDF词频逆文档频率算法

一.简介

  1.RF-IDF【term frequency-inverse document frequency】是一种用于检索与探究的常用加权技术。

  2.TF-IDF是一种统计方法,用于评估一个词对于一个文件集或一个语料库中的其中一个文件的重要程度。

  3.词的重要性随着它在文件中出现的次数的增加而增加,但同时也会随着它在语料库中出现的频率的升高而降低。

二.词频

  指的是某一个给定的词语在一份给定的文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件【同一个词语在文件里可能会比短文件有更高的词频,而不管该词重要与否】。

  公式:

    

  ni,j:是该词在文件dj中出现的次数,而分母则是在文件dj中所有词出现的次数之和。

三.逆文档频率

  是一个词普遍重要性的度量。某一个特定词的IDF可以由总文件数目除以包含该词语的文件数据,再将得到的商取对数得到。

  公式:

    

  |D|:语料库中的文件总数

  |{j:ti€dj}|:包含ti的文件总数

四.TF-IDF

  公式:TF-IDF = TF * IDF

  特点:某一特定文件内的高频率词语,以及该词语在整个语料库中的低文件频率,可以产生高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。

  思想:如果某个词或短语在一篇文章中出现的频率TF高,并且在其它文章中很少出现,则认为此词或短语具有很好的类别区分能力,适合用来分类。

五.代码实现

 1 package big.data.analyse.tfidf
 2 
 3 import org.apache.log4j.{Level, Logger}
 4 import org.apache.spark.sql.SparkSession
 5 
 6 /**
 7   * Created by zhen on 2019/05/28.
 8   */
 9 object TF_IDF {
10   /**
11     * 设置日志级别
12     */
13   Logger.getLogger("org").setLevel(Level.WARN)
14   def main(args: Array[String]) {
15     val spark = SparkSession
16       .builder()
17       .appName("TF_IDF")
18       .master("local[2]")
19       .config("spark.sql.warehouse.dir", "file:///D://warehouse").getOrCreate()
20     val sc = spark.sparkContext
21     /**
22       * 计算TF
23       */
24     val tf = sc.textFile("src/big/data/analyse/tfidf/TF.txt")
25       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " ")) // 数据清洗
26       .flatMap(row => row.split(" ")) // 拆分
27       .map(row => (row, 1.0))
28       .reduceByKey(_+_)
29 
30     val tfSize = tf.map(row => row._2).sum() // 计算总词数
31 
32     val tfed = tf.map(row => (row._1, row._2 / tfSize.toDouble)) //求词频
33     println("TF:")
34     tfed.foreach(println)
35 
36     /**
37       * 计算IDF
38       */
39     val idf_0 = tf.map(row => (row._1, 1.0))
40     println("加载IDF1文件数据。。。")
41     val idf_1 = sc.textFile("src/big/data/analyse/tfidf/IDF1.txt")
42       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " "))
43       .flatMap(row => row.split(" "))
44       .map(row => (row, 1.0))
45       .reduceByKey(_+_)
46       .map(row => (row._1, 1.0))
47 
48     println("加载IDF2文件数据。。。")
49     val idf_2 = sc.textFile("src/big/data/analyse/tfidf/IDF2.txt")
50       .map(row => row.replace(",", " ").replace(".", " ").replace("  ", " "))
51       .flatMap(row => row.split(" "))
52       .map(row => (row, 1.0))
53       .reduceByKey(_+_)
54       .map(row => (row._1, 1.0))
55 
56     /**
57       * 整合语料库数据
58       */
59     val idf = idf_0.union(idf_1).union(idf_2)
60       .reduceByKey(_+_)
61       .map(row => (row._1, 3 / row._2))
62     println("IDF:")
63     idf.foreach(println)
64 
65     /**
66       * 关联TF和IDF,计算TF-IDF
67       */
68     println("TF-IDF:")
69     tfed.join(idf).map(row => (row._1, (row._2._1 * row._2._2).formatted("%.4f")))
70       .foreach(println)
71   }
72 }

六.结果

TF:
(GraphX,0.011494252873563218)
(are,0.011494252873563218)
(learning,0.011494252873563218)
(Python,0.011494252873563218)
(provides,0.011494252873563218)
(is,0.022988505747126436)
(Please,0.011494252873563218)
(higher-level,0.011494252873563218)
(general,0.011494252873563218)
(Security,0.034482758620689655)
(R,0.011494252873563218)
(fast,0.011494252873563218)
(SQL,0.022988505747126436)
(Apache,0.011494252873563218)
(Java,0.011494252873563218)
(data,0.011494252873563218)
(attack,0.011494252873563218)
(This,0.011494252873563218)
(cluster,0.011494252873563218)
(graph,0.011494252873563218)
(execution,0.011494252873563218)
(MLlib,0.011494252873563218)
(Scala,0.011494252873563218)
(computing,0.011494252873563218)
(downloading,0.011494252873563218)
(Streaming,0.011494252873563218)
(supports,0.022988505747126436)
(engine,0.011494252873563218)
(set,0.011494252873563218)
(running,0.011494252873563218)
(Spark,0.08045977011494253)
(you,0.011494252873563218)
(Overview,0.011494252873563218)
(general-purpose,0.011494252873563218)
(rich,0.011494252873563218)
(APIs,0.011494252873563218)
(vulnerable,0.011494252873563218)
(that,0.011494252873563218)
(a,0.022988505747126436)
(high-level,0.011494252873563218)
(processing,0.022988505747126436)
(OFF,0.011494252873563218)
(before,0.011494252873563218)
(including,0.011494252873563218)
(could,0.011494252873563218)
(optimized,0.011494252873563218)
(in,0.022988505747126436)
(to,0.011494252873563218)
(see,0.011494252873563218)
(graphs,0.011494252873563218)
(of,0.011494252873563218)
(also,0.011494252873563218)
(by,0.022988505747126436)
(structured,0.011494252873563218)
(tools,0.011494252873563218)
(It,0.022988505747126436)
(for,0.034482758620689655)
(mean,0.011494252873563218)
(an,0.011494252873563218)
(machine,0.011494252873563218)
(and,0.06896551724137931)
(system,0.011494252873563218)
(default,0.022988505747126436)
加载IDF1文件数据。。。
加载IDF2文件数据。。。
IDF:
(running,1.5)
(For,3.0)
(visit,3.0)
(The,3.0)
(you,1.0)
(website,1.5)
(than,3.0)
(7,3.0)
(PATH,3.0)
(that,1.0)
(was,1.5)
(a,1.0)
(main,3.0)
(old,3.0)
(high-level,1.5)
(be,1.5)
(quick,3.0)
(processing,1.5)
(could,1.5)
(all,3.0)
(augmenting,3.0)
(optimized,1.5)
(Downloads,3.0)
(follow,3.0)
(applications,3.0)
(classpath,3.0)
(structured,1.5)
(like,1.5)
(along,3.0)
(support,3.0)
(Spark’s,1.5)
(If,3.0)
(but,3.0)
(and,1.0)
(reference,3.0)
(1,3.0)
(g,3.0)
(system,1.5)
(your,3.0)
(10,3.0)
(It’s,3.0)
(are,1.0)
(learning,1.5)
(download,1.5)
(its,3.0)
(After,3.0)
(Building,3.0)
(can,1.5)
(Security,1.5)
(have,3.0)
(runs,3.0)
(6,3.0)
(build,3.0)
(0,1.5)
(SQL,1.0)
(with,1.5)
(locally,3.0)
(projects,3.0)
(their,3.0)
(Get,3.0)
(UNIX-like,3.0)
(This,1.0)
(,1.5)
(first,3.0)
(documentation,3.0)
(Since,3.0)
(still,3.0)
(Downloading,3.0)
(packaged,3.0)
(better,3.0)
(However,3.0)
(switch,3.0)
(hood,3.0)
(Linux,3.0)
(Streaming,1.5)
(supports,1.5)
(PyPI,3.0)
((2,3.0)
(vulnerable,1.5)
(RDD,3.0)
(Dataset,3.0)
(package,3.0)
(this,3.0)
(under,3.0)
(Python,1.0)
(provides,1.0)
(API,1.5)
(higher-level,1.5)
(introduction,3.0)
(Apache,1.5)
(will,1.5)
(Java,1.0)
(2,1.5)
(data,1.5)
(as,3.0)
(YARN,3.0)
(installed,3.0)
(pointing,3.0)
(optimizations,3.0)
(get,3.0)
(cluster,1.5)
(tutorial,3.0)
(graph,1.5)
(easy,3.0)
(execution,1.5)
(MLlib,1.5)
(We,3.0)
(you’d,3.0)
(supported,3.0)
(downloading,1.5)
(shell,3.0)
(handful,3.0)
(1+,3.0)
(Users,3.0)
(engine,1.5)
(version,1.5)
(11,3.0)
(set,1.5)
(performance,3.0)
(rich,1.5)
(systems,3.0)
(replaced,3.0)
(Spark,1.0)
(project,3.0)
(Overview,1.5)
(APIs,1.5)
(Mac,3.0)
(or,1.5)
(popular,3.0)
(Support,3.0)
(richer,3.0)
(downloads,3.0)
(OFF,1.5)
(future,3.0)
(detailed,3.0)
(GraphX,1.5)
(removed,3.0)
(4,3.0)
(installation,3.0)
(Please,1.5)
(is,1.0)
(guide,3.0)
(recommend,3.0)
(R,1.5)
(general,1.5)
(JAVA_HOME,3.0)
(fast,1.5)
(include,3.0)
(need,3.0)
(one,3.0)
(attack,1.5)
(how,3.0)
(uses,3.0)
(compatible,3.0)
(information,3.0)
(we,3.0)
(interactive,3.0)
(—,3.0)
(using,1.5)
(Note,1.5)
(7+/3,3.0)
(java,3.0)
(pre-packaged,3.0)
(Scala,1.0)
(any,1.5)
(computing,1.5)
(variable,3.0)
(users,3.0)
(from,1.5)
(has,3.0)
(won’t,3.0)
(through,3.0)
(at,3.0)
(more,3.0)
(3,3.0)
(versions,3.0)
(of,1.0)
(tools,1.5)
(8+,3.0)
(by,1.0)
(mean,1.5)
(RDDs,3.0)
((e,3.0)
(It,1.5)
(for,1.0)
(To,3.0)
(were,3.0)
(both,3.0)
(an,1.0)
(12,3.0)
(which,3.0)
(machine,1.5)
(libraries,3.0)
(introduce,3.0)
(environment,3.0)
((in,3.0)
(programming,3.0)
(See,3.0)
(use,1.5)
(default,1.5)
(the,1.5)
(write,3.0)
(highly,3.0)
(release,3.0)
(Resilient,3.0)
(interface,3.0)
(strongly-typed,3.0)
(about,3.0)
(run,3.0)
(general-purpose,1.5)
(5,3.0)
(Distributed,3.0)
(on,3.0)
(You,3.0)
(source,3.0)
(Scala),3.0)
(show,3.0)
(then,3.0)
(before,1.0)
(including,1.5)
(to,1.0)
(in,1.0)
(client,3.0)
(see,1.5)
(HDFS,1.5)
(graphs,1.5)
(Hadoop’s,3.0)
(also,1.5)
(“Hadoop,3.0)
(binary,3.0)
(x),3.0)
(free”,3.0)
(Maven,3.0)
(coordinates,3.0)
(Windows,3.0)
(deprecated,3.0)
(install,3.0)
((RDD),3.0)
(4+,3.0)
(page,3.0)
(OS),3.0)
(Hadoop,1.5)
TF-IDF:
(you,0.0115)
(that,0.0115)
(a,0.0230)
(high-level,0.0172)
(processing,0.0345)
(could,0.0172)
(optimized,0.0172)
(structured,0.0172)
(and,0.0690)
(system,0.0172)
(are,0.0115)
(learning,0.0172)
(Security,0.0517)
(SQL,0.0230)
(This,0.0115)
(Streaming,0.0172)
(supports,0.0345)
(vulnerable,0.0172)
(Spark,0.0805)
(Overview,0.0172)
(APIs,0.0172)
(OFF,0.0172)
(of,0.0115)
(tools,0.0172)
(by,0.0230)
(mean,0.0172)
(It,0.0345)
(for,0.0345)
(an,0.0115)
(machine,0.0172)
(default,0.0345)
(Python,0.0115)
(provides,0.0115)
(higher-level,0.0172)
(Apache,0.0172)
(GraphX,0.0172)
(Please,0.0172)
(is,0.0230)
(R,0.0172)
(general,0.0172)
(fast,0.0172)
(attack,0.0172)
(Java,0.0115)
(Scala,0.0115)
(computing,0.0172)
(data,0.0172)
(cluster,0.0172)
(graph,0.0172)
(execution,0.0172)
(MLlib,0.0172)
(downloading,0.0172)
(engine,0.0172)
(set,0.0172)
(rich,0.0172)
(general-purpose,0.0172)
(before,0.0115)
(including,0.0172)
(to,0.0115)
(in,0.0230)
(see,0.0172)
(graphs,0.0172)
(also,0.0172)

Process finished with exit code 0
原文地址:https://www.cnblogs.com/yszd/p/10939583.html