编程Tips集锦

以下是自己编程的一些小贴士，记录，总结提高自己。

1.python中集合类型的查找，尽量用dict or set类型。

dict和set类型，在python内部的实现都是使用hash映射，查找的时间复杂度是O(1)，比任何的查找算法都高效。

当在程序中使用到>1K次的查询，就应该开始考虑使用dict或set类型来进行数据的组织。

 1 #coding:utf-8
 2 from urllib.request import urlopen
 3 from bs4 import BeautifulSoup
 4 import re
 5 import string
 6 import operator
 7 import datetime
 8 
 9 commonWords = ["the", "be", "and", "of", "a", "in", "to", "have", "it", "i", "that", "for", "you", "he", "with", "on", "do", "say", "this", "they", "is", "an", "at", "but","we", "his", "from", "that", "not", "by", "she", "or", "as", "what", "go", "their","can", "who", "get", "if", "would", "her", "all", "my", "make", "about", "know", "will","as", "up", "one", "time", "has", "been", "there", "year", "so", "think", "when", "which", "them", "some", "me", "people", "take", "out", "into", "just", "see", "him", "your", "come", "could", "now", "than", "like", "other", "how", "then", "its", "our", "two", "more", "these", "want", "way", "look", "first", "also", "new", "because", "day", "more", "use", "no", "man", "find", "here", "thing", "give", "many", "well"]
10 #若不注释，则为set类型，跑一遍程序，对比一下，则知优劣！
11 #commonWords = set(commonWords)
12 
13 def isCommon(word):
14     global commonWords
15     if word in commonWords:
16         return True
17     return False
18 
19 
20 def cleanText(input):
21     input = re.sub('
+', " ", input).lower()
22     input = re.sub('[[0-9]*]', "", input)
23     input = re.sub(' +', " ", input)
24     input = re.sub("u.s.", "us", input)
25     input = bytes(input, "UTF-8")
26     input = input.decode("ascii", "ignore")
27     return input
28 
29 def cleanInput(input):
30     input = cleanText(input)
31     cleanInput = []
32     input = input.split(' ')
33     for item in input:
34         item = item.strip(string.punctuation)
35         if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
36             cleanInput.append(item)
37 
38     cleanContent = []
39     for word in cleanInput:
40         if not isCommon(word):
41             cleanContent.append(word)
42     return cleanContent
43 
44 def getNgrams(input, n):
45     input = cleanInput(input)
46     output = {}
47     for i in range(len(input)-n+1):
48         ngramTemp = " ".join(input[i:i+n])
49         if ngramTemp not in output:
50             output[ngramTemp] = 0
51         output[ngramTemp] += 1
52     return output
53 
54 def getFirstSentenceContaining(ngram, content):
55     #print(ngram)
56     sentences = content.split(".")
57     for sentence in sentences:
58         if ngram in sentence:
59             return sentence
60     return ""
61 
62 content = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
63 
64 print('Use the set as the format of common words.')
65 print('Begin:',datetime.datetime.now())
66 for i in range(50):
67     ngrams = getNgrams(content, 2)
68     sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse = True)
69 print('End:',datetime.datetime.now())
70 print(sortedNGrams)

View Code

2.python往数据库插入数据

在插入数据之前，记得先进行一次查询，查看数据是否已经在数据库中。

一可以使程序更健壮，二也可顺便避免二次查询。

3.数据库在建表的时候，最后有索引

最近需要往数据库中插入上百万级的数据，十万级以后之后，数据库变得极慢，磁盘读写也是爆满！

后来，发现查询次数太多，重新建表，顺便加入索引。特别是unique index，我猜背后的实现机制是hash映射。

加入索引之后的数据库，大大减轻了磁盘的负担，查询速度几乎恒定，不过数据库的增大还是降低了读写的速度（实属情理之中）。

3.python字符串中转义字符的处理

python中所占位为4位，不是通常的8位。

************************************
给我一个支点，我可以改变整个世界！