文件文档文档的词频反向文档频率(TFIDF)计算

题记：写这篇博客要主是加深自己对文件文档的认识和总结实现算法时的一些验经和训教，如果有错误请指出，万分感谢。

TF-IDF盘算：

TF-IDF映反了在文档合集中一个单词对一个文档的重要性，经常在文本数据挖据与信息

提取用中来作为重权因子。在一份给定的文件里，频词(termfrequency-TF)指的是某一

个给定的词语在该文件中涌现的率频。逆向文件率频（inversedocument frequency，

IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以含包

该词语之文件的数目，再将失掉的商取对数失掉。

关相代码：

private static Pattern r = Pattern.compile("([ \\t{}()\",:;. \n])"); 
	private static List<String> documentCollection;

    //Calculates TF-IDF weight for each term t in document d
    private static float findTFIDF(String document, String term)
    {
        float tf = findTermFrequency(document, term);
        float idf = findInverseDocumentFrequency(term);
        return tf * idf;
    }

    private static float findTermFrequency(String document, String term)
    {
    	int count = getFrequencyInOneDoc(document, term);

        return (float)((float)count / (float)(r.split(document).length));
    }
    
    private static int getFrequencyInOneDoc(String document, String term)
    {
    	int count = 0;
        for(String s : r.split(document))
        {
        	if(s.toUpperCase().equals(term.toUpperCase())) {
        		count++;
        	}
        }
        return count;
    }


    private static float findInverseDocumentFrequency(String term)
    {
        //find the  no. of document that contains the term in whole document collection
        int count = 0;
        for(String doc : documentCollection)
        {
        	count += getFrequencyInOneDoc(doc, term);
        }
        /*
         * log of the ratio of  total no of document in the collection to the no. of document containing the term
         * we can also use Math.Log(count/(1+documentCollection.Count)) to deal with divide by zero case; 
         */
        return (float)Math.log((float)documentCollection.size() / (float)count);

    }

每日一道理
生活的无奈，有时并不源于自我，别人无心的筑就，那是一种阴差阳错。生活本就是矛盾的，白天与黑夜间的距离，春夏秋冬之间的轮回，于是有了挑剔的喜爱，让无奈加上了喜悦的等待。

立建文档的向量空间模型Vector Space Model并盘算余弦相似度。

关相代码：

public static float findCosineSimilarity(float[] vecA, float[] vecB)
{
    float dotProduct = dotProduct(vecA, vecB);
    float magnitudeOfA = magnitude(vecA);
    float magnitudeOfB = magnitude(vecB);
    float result = dotProduct / (magnitudeOfA * magnitudeOfB);
    //when 0 is divided by 0 it shows result NaN so return 0 in such case.
    if (Float.isNaN(result))
        return 0;
    else
        return (float)result;
}

public static float dotProduct(float[] vecA, float[] vecB)
{

    float dotProduct = 0;
    for (int i = 0; i < vecA.length; i++)
    {
        dotProduct += (vecA[i] * vecB[i]);
    }

    return dotProduct;
}

// Magnitude of the vector is the square root of the dot product of the vector with itself.
public static float magnitude(float[] vector)
{
    return (float)Math.sqrt(dotProduct(vector, vector));
}

意注点

：

零词滤过(stop-words filter)

零词列表

ftp://ftp.cs.cornell.edu/pub/smart/english.stop

关于TF-IDF参考这里：

链接–> http://en.wikipedia.org/wiki/Tf*idf

文章结束给大家分享下程序员的一些笑话语录：大家喝的是啤酒，这时你入座了。
你给自己倒了杯可乐，这叫低配置。
你给自已倒了杯啤酒，这叫标准配置。
你给自己倒了杯茶水，这茶的颜色还跟啤酒一样，这叫木马。
你给自己倒了杯可乐，还滴了几滴醋，不仅颜色跟啤酒一样，而且不冒热气还有泡泡，这叫超级木马。
你的同事给你倒了杯白酒，这叫推荐配置。
菜过三巡，你就不跟他们客气了。
你向对面的人敬酒，这叫p2p。
你向对面的人敬酒，他回敬你，你又再敬他……，这叫tcp。
你向一桌人挨个敬酒，这叫令牌环。
你说只要是兄弟就干了这杯，这叫广播。
有一个人过来向这桌敬酒，你说不行你先过了我这关，这叫防火墙。
你的小弟们过来敬你酒，这叫一对多。
你是boss，所有人过来敬你酒，这叫服务器。
酒是一样的，可是喝酒的人是不同的。
你越喝脸越红，这叫频繁分配释放资源。
你越喝脸越白，这叫资源不释放。
你已经醉了，却说我还能喝，叫做资源额度不足。
你明明能喝，却说我已经醉了，叫做资源保留。
喝酒喝到最后的结果都一样
你突然跑向厕所，这叫捕获异常。
你在厕所吐了，反而觉得状态不错，这叫清空内存。
你在台面上吐了，觉得很惭愧，这叫程序异常。
你在boss面前吐了，觉得很害怕，这叫系统崩溃。
你吐到了boss身上，只能索性晕倒了，这叫硬件休克。