3-11日学习记录

2.关于各种方式存储稀疏矩阵python，这个讲的非常具体

https://www.cnblogs.com/hellojamest/p/11769467.html （待看）

3.果然我就猜测是否会GPU内存溢出。。

RuntimeError: CUDA out of memory. Tried to allocate 1.28 GiB (GPU 0; 10.76 GiB total capacity;
 8.17 GiB already allocated; 825.56 MiB free; 1.02 GiB cached)

https://github.com/pytorch/pytorch/issues/958，从这个解答中可以看出，有一个例子是给了好多全连接层，然后溢出了。这个链接的讲解是great的！！！

那我有个疑问，如果GPU溢出的话，能不能放到cpu上来训练，但是这样的话，模型还有什么意义。。。

https://zhuanlan.zhihu.com/p/61892329 这个里面的让我很震惊啊，真的是这个意思啊，原来溢出是和数据集大小和模型大小是没关系的吗，握哭了，原来是这样。

以后遇到溢出的话，就直接只考虑把batch_size足够减小就好了。。。

3-12————————————————

4.如果模型不收敛

https://zhuanlan.zhihu.com/p/36369878

http://theorangeduck.com/page/neural-network-not-working#batchsize

5.sklearn中的另一个标准化方法StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

均值和标准差：

>>> scaler.mean_
array([0.5, 0.5])
>>> scaler.var_
array([0.25, 0.25])

总的过程就是x-miu/std，是标准差，不是方差哦。

3-13————————————————————

1.TF-IDF逆文档频率

https://zhuanlan.zhihu.com/p/31197209

需要理解的三个点：词频和逆文档频率的计算方法、and优缺点。

TF-IDF为两者相乘，与词频成正比，与逆文档频率成反比，这样的话就能够选出来那些在文档中出现次数多，但在其他文档中出现次数少的词，以此来作为一个文档的标志。

优缺点：有时候关键的词可能出现次数并不多，而且不能反应上下文关系，只是从一个频率上去反映。

这个博客讲的就蛮清楚的了。

sklearn中计算

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

#输出：
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
>>> X
<4x9 sparse matrix of type '<class 'numpy.float64'>'
    with 21 stored elements in Compressed Sparse Row format>
>>> vectorizer.vocabulary
>>> vectorizer.vocabulary_
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
>>> len(vectorizer.vocabulary_)
9
>>> X[0]
<1x9 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
>>> X[0].toarray()#可以看到它是按照字典中的从小到达的顺序来作为句子的特征的
array([[0.        , 0.46979139, 0.58028582, 0.38408524, 0.        ,
        0.        , 0.38408524, 0.        , 0.38408524]])