Kmeans 聚类之建立文档向量模型(VSM)

作者：finallyliuyu 转载使用等请注明出处

在上一篇博文《Kmeans聚类之特征词选择DF》中我们已经给出了特征词选择的代码，这里我们将给出建立文档向量模型的代码，以及将文档向量模型写成Weka数据格式的代码。关于Weka数据格式等相关内容,请见：教程。

首先我们给出写Arff头文件的代码

void Preprocess::WriteHeadArff()
{
	ofstream ofile(arffFileAddress,ios::binary);
	ofile<<"@relation aticle"<<endl;
	ofile<<"\n";
	vector<string> myKeys=GetFinalKeyWords();
	for(vector<string>::iterator it=myKeys.begin();it!=myKeys.end();it++)
	{
		//string temp="@attribute "+"'"+(*it)+"'"+" real";
		string temp="";
		temp+="@attribute ";
		temp+="'";
		temp+=*(it);
		temp+="'";
		temp+=" real";
		/*strcpy(temp,"@attribute ");
		strcpy(temp,"'");
		strcpy(temp,*(it));
		strcpy(temp,"'");
		strcpy(temp," real");*/

		ofile<<temp<<endl;
	}
	ofile<<"\n"<<endl;
	ofile<<"@data"<<endl;
	ofile.close();
}

下面重点介绍采用TF-IDF权重建立文档向量模型：

在给出代码之前先简要介绍下什么是TF，DF

对于一个特定的Term t

TF,指的是它在吗某一篇文章中出现的次数；

DF，指的是整个文档集合中出现该词的文章篇数；

文档向量模型（Vector Space Model）：向量。向量的属性为用《Kmeans聚类之特征词选择DF》中的特征词选择方法选定的特征词。

整个文档集合的VSM模型实际上是以矩阵的格式保存的。矩阵的每一行，代表一篇文章，是一个文档向量。

TF-IDF模型有很多权重计算模式：（注意：以下截图来自于计算所王斌老师的课件《现代信息检索》）在这里顺便给大家介绍一本十分不错的书《信息检索导论》 (Introduction to Information Retrieval)原版第一作者为斯坦福大学计算机语言学副教授Christopher D. Manning。该书由王斌老师翻译成中文，现在已经出版。

TF权重计算模式：

IDF权重计算模式：

归一化方式：

考虑到一篇文章可能完全不含有我们用DF选择法选择的特征词。

那么这篇文章的VSM就是{0,0,0,..0}

为了避免产生这种类型的稀疏数据，我采用的TF-IDF计算模式为

a-l-c。

大家对应上面三个表找一下，就找到相应的计算公式了。

下面开始建立文档向量模型：

获得每个特征词对应的maxTF和DF：

ector<pair<int,int> >Preprocess::GetfinalKeysMaxTFDF(map<string,vector<pair<int,int>>> &mymap)
{
	vector<pair<int,int> >maxTFandDF;
	vector<string>myKeys=GetFinalKeyWords();
	for(vector<string>::iterator it=myKeys.begin();it!=myKeys.end();it++)
	{  
		int DF=mymap[*it].size();
		int maxTF=0;
		for(vector<pair<int,int> >::iterator subit=mymap[*it].begin();subit!=mymap[*it].end();subit++)
		{
			if(subit->second>maxTF)
			{
				maxTF=subit->second;
			}

		}
		maxTFandDF.push_back(make_pair(maxTF,DF));
		//find_if(mymap[*it].begin(),mymap[*it].end(),
	}
	return maxTFandDF;
}

************************************************************************/
/* 文档向量模型归一化                                                                     */
/************************************************************************/
vector<pair<int,double> >Preprocess::NormalizationVSM(vector<pair<int,double> > tempVSM)
{

	double sum=0;
	for(vector<pair<int,double> >::iterator vsmit=tempVSM.begin();vsmit!=tempVSM.end();++vsmit)
	{
		sum+=pow(vsmit->second,2);
	}
	for(vector<pair<int,double> >::iterator vsmit=tempVSM.begin();vsmit!=tempVSM.end();++vsmit)
	{
		vsmit->second/=sqrt(sum);
	}
	return tempVSM;

}

有了上面的辅助函数，那么我们可以一边对文档集合建立文档向量模型，一边写arff文件的data部分了。

首先还要给出两个辅助函数，分别完成浮点数和整数字符串化的功能

/************************************************************************/
/* 将整数转化成字符串                                                   */
/************************************************************************/

string Preprocess::do_fraction(int val)
{
	ostringstream out;
	out<<val;
	string str= out.str(); //从流中取出字符串
	str.swap(string(str.c_str()));//删除nul之后的多余字符
	return str;

}

/************************************************************************/
/* 将浮点数转化成指定精度的字符串                                       */
/************************************************************************/
string Preprocess::do_fraction(double val,int decplaces)
{
	
	//int prec=numeric_limits<double>::digits10;
	char DECIMAL_POINT='.'; 
	ostringstream out;
	//out.precision(prec);
	out<<val;
	string str=out.str();
	size_t n=str.find(DECIMAL_POINT);
	if((n!=string::npos)&&n+decplaces<str.size())
	{
		str[n+decplaces]='\0';
	}
	str.swap(string(str.c_str()));

	return str;
}

将一篇文档的VSM字符串化的函数：

************************************************************************/
/*              单个文档向量模型字符串化                                                        */
/************************************************************************/
string Preprocess::FormatVSMtoString(vector<pair<int,double> > tempVSM)
{
	string ret="{";
	int commaindication=0;
	for(vector<pair<int,double> >::iterator vsmit=tempVSM.begin();vsmit!=tempVSM.end();++vsmit)
	{   

		ret+=do_fraction(vsmit->first)+" "+do_fraction(vsmit->second,8);
		if(commaindication<tempVSM.size()-1)
		{
			ret+=",";
		}
		commaindication++;
	}
	ret+="}";
	return ret;
}

下面的函数调用上面的FormatVSMtoString 填写arff文件的data字段

/************************************************************************/
/* 将实验数据写成arff @data格式                                                                     */
/************************************************************************/
void Preprocess::VSMFormation(map<string,vector<pair<int,int>>> &mymap)
{   int corpus_N=endIndex-beginIndex+1;
	ofstream ofile1(articleIdsAddress,ios::binary);//保存文章编号的文件
	ofstream ofile2(arffFileAddress,ios::binary|ios::app);

	vector<string> myKeys=GetFinalKeyWords();
	vector<pair<int,int> >maxTFandDF=GetfinalKeysMaxTFDF(mymap);
	for(int i=beginIndex;i<=endIndex;i++)
	{   vector<pair<int,double> >tempVSM;
		for(vector<string>::size_type j=0;j<myKeys.size();j++)
		{
		//vector<pair<int,int> >::iterator findit=find_if(mymap[myKeys[j]].begin(),mymap[myKeys[j]].end(),PredTFclass(i));
			double TF=(double)count_if(mymap[myKeys[j]].begin(),mymap[myKeys[j]].end(),PredTFclass(i));


			TF=0.5+0.5*(double)TF/(maxTFandDF[j].first);
			TF*=log((double)corpus_N/maxTFandDF[j].second);
			if(TF!=0)
			{
				tempVSM.push_back(make_pair(j,TF));

			}



		}
		if(!tempVSM.empty())
		{
			tempVSM=NormalizationVSM(tempVSM);
			string vsmStr=FormatVSMtoString(tempVSM);
			ofile1<<i<<endl;
			ofile2<<vsmStr<<endl;
		}
		tempVSM.clear();



	}
	ofile1.close();
	ofile2.close();


}

至此文档向量模型建立模块的代码已经介绍完毕。

未完，待续，下次我们将介绍如何从weka获得计算出的聚类中心，完成文本聚类。