Python_Crawler_Foundation3_CSS_Xpath_Json_XML

Python_Crawler_Foundation3_CSS_Xpath_Json_XML_RegExp

Python Simple Crawler

Using XML.DOM or XML.sax to parser XML files. (https://www.tutorialspoint.com/python/python_xml_processing.htm )

CSS (Cascading Style Sheets) 是级联样式表，它是用于解决如何显示HTML元素。要解决如果显示html元素，就要解决如果对html元素定位。

为什么要使用CSS来定义HTML元素，而不是直接用属性设置元素。

直接使用属性：<P font-size=” ” “color = > abcd</P>。这样你要手动一个个去修改属性。

而CSS可以通过ID class或其他方法快速定位到一大批元素。

h1就是一个selector。color和front-size就是属性，red和14px就是值。

head，p，h1都是元素。h1，h2代表段落标签，image代表图像标签，等。
class是元素的属性。
.important选择的是所有有这个类属性的元素，如<p class="important"> or <h1 class="important">中，P和h1都是会被选中，他们都有import这个类。
可以结合元素选择器来定位，如：P.important，选择的是具有import类的P元素。

元素选择器+类选择器

ID选择器：可以给每个元素一个ID。ID选择器与class选择器很像，但class是可以共享，如，不同的标签（head，body，p，等等）可以有共同的class选择器，如上，P.important和h1.important，但id是全局唯一的。

id选择器和class选择器都是属性选择器的特殊选择器。

这里css文件中，只适用了class选择器。

XPath

http://www.w3school.com.cn/xpath/index.asp

节点之间的关系像是文件系统中的文件路径的方式，可看作一个目录树，可以更方便的遍历和查找节点。这是CSS做不到的，CSS只能顺序，而XPath可以顺序也可以逆序。

html的语言和xml语言长得很像，但是html语言的语法要求不是很严格的，对浏览器而言，它会尽量解析。如果把一段html放到xml解析器里，很有可能会失败。

JQuery 来解析xml。JQuery是用xpath和CSS来定位元素的。

#outcome：

字典是不保证顺序的，所以当用json导数据的时候，顺序是随机的。

量小用DOM，量大就用SAX。

https://www.tutorialspoint.com/python/python_xml_processing.htm

Here is the easiest way to quickly load an XML document and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file.

The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.

books.xml

 1 <?xml version="1.0" encoding="ISO-8859-1"?>
 2 
 3 <bookstore shelf="new arrives">
 4 
 5 <book category="COOKING">
 6   <title lang="en">Everyday Italian</title>
 7   <author>Giada De Laurentiis</author>
 8   <year>2005</year>
 9   <price>30.00</price>
10 </book>
11 
12 <book category="CHILDREN">
13   <title lang="en">Harry Potter</title>
14   <author>J K. Rowling</author>
15   <year>2005</year>
16   <price>29.99</price>
17 </book>
18 
19 <book category="WEB">
20   <title lang="en">XQuery Kick Start</title>
21   <author>James McGovern</author>
22   <author>Per Bothner</author>
23   <author>Kurt Cagle</author>
24   <author>James Linn</author>
25   <author>Vaidyanathan Nagarajan</author>
26   <year>2003</year>
27   <price>49.99</price>
28 </book>
29 
30 <book category="WEB">
31   <title lang="en">Learning XML</title>
32   <author>Erik T. Ray</author>
33   <year>2003</year>
34   <price>39.95</price>
35 </book>
36 
37 </bookstore>

XML_Dom.py

 1 from xml.dom import minidom
 2 import xml.dom.minidom
 3 
 4 # Open XML document using minidom parser (parse, documentElement, xml.dom.minidom.Element)
 5 doc = minidom.parse('books.xml')
 6 root = doc.documentElement
 7 print(type(root))
 8 print(root.nodeName)
 9 
10 # getting attribute values. (hasAttribute, getAttribute)
11 if root.hasAttribute("shelf"):
12     print ("Root element : %s" % root.getAttribute("shelf"))   
13 else:
14     print('getting vlaue failure')
15 
16 # Print detail of each books (childNodes[0].data, xml.dom.minicompat.NodeList)
17 books = root.getElementsByTagName('book')
18 book_list= []
19 price_list = []
20 for book in books:
21     if book.hasAttribute("category"):
22         print("**** %s ****" %book.getAttribute("category"))
23     titles = book.getElementsByTagName('title')
24     print(type(titles))
25     book_list.append(titles[0].childNodes[0].data)
26     print("Book's Name: %s" %titles[0].childNodes[0].data)
27     prices = book.getElementsByTagName('price')
28     price_list.append(prices[0].childNodes[0].data)
29     print("prices: %s" %prices[0].childNodes[0].data)
30     print("
")
31 #print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue
32 print(book_list)
33 print(price_list)

#outcome

 1 <class 'xml.dom.minidom.Element'>
 2 bookstore
 3 Root element : new arrives
 4 
 5 **** COOKING ****
 6 <class 'xml.dom.minicompat.NodeList'>
 7 Book's Name: Everyday Italian
 8 prices: 30.00
 9 
10 
11 **** CHILDREN ****
12 <class 'xml.dom.minicompat.NodeList'>
13 Book's Name: Harry Potter
14 prices: 29.99
15 
16 
17 **** WEB ****
18 <class 'xml.dom.minicompat.NodeList'>
19 Book's Name: XQuery Kick Start
20 prices: 49.99
21 
22 
23 **** WEB ****
24 <class 'xml.dom.minicompat.NodeList'>
25 Book's Name: Learning XML
26 prices: 39.95
27 
28 
29 ['Everyday Italian', 'Harry Potter', 'XQuery Kick Start', 'Learning XML']
30 ['30.00', '29.99', '49.99', '39.95']

Outcomes:

book.xml

<?xml version="1.0"?>
<bookstore>　　　　　　#<bookstore> = <root>。 bookstore相当于root是一个根节点。
	<book>
		<title>Learn Python</title>
		<price>100</price>
	</book>
	<book>
		<title>Learn XML</title>
		<price>100</price>
	</book>
</bookstore>

XML_DOM.py

 1 from xml.dom import minidom
 2 
 3 doc = minidom.parse('book.xml')　　　　# <class 'xml.dom.minidom.Document'> doc是一个文档树，是从根节点开始遍历，对应的就是根节点。
 4 root = doc.documentElement　　　　　　　# <class 'xml.dom.minidom.Element'> 将doc对应的根节点传给root，所以root也对应根节点。
 5 print(type(root))　　　　　　　　
 6 #print(dir(root))       　　　　　　　　# use 'dir' to see all methods of the function
 7 print(root.nodeName)
 8 books = root.getElementsByTagName('book')   　 #getElements将所有元素给books，books是一个数组。<class 'xml.dom.minicompat.NodeList'> 
 9 for book in books:　　　　　　　　　　　　　　　　　#books,titles,price are <class 'xml.dom.minicompat.NodeList'>
10     titles = book.getElementsByTagName('title')  　　#在当前book节点下，找到所有元素名为title的元素。
11     prices = book.getElementsByTagName('price')
12     print(titles[0].childNodes[0].nodeValue,prices[0].childNodes[0].nodeValue)

#Outcome:

<class 'xml.dom.minidom.Element'>
bookstore
Learn Python 100
Learn XML 80

XML_SAX.py 　　SAX处理方式在XML数据库里处理比较多，用于处理大型的xml页面。

 1 import string
 2 from xml.parsers.expat import ParserCreate
 3 
 4 class DefaultSaxHandler(object):
 5     def start_element(self,name,attrs):
 6         self.name = name
 7         print('element: %s, attrs: %s' %(name,str(attrs)))
 8     def end_element(self,name):
 9         print('end element: %s' %name)
10     def char_data(self,text):
11         if text.strip():
12             print("%s's text is %s" %(self.name, text))
13 
14 handler = DefaultSaxHandler()
15 parser = ParserCreate()
16 parser.StartElementHandler = handler.start_element     #<book>
17 parser.EndElementHandler =  handler.end_element       #</book>
18 parser.CharacterDataHandler = handler.char_data        #<title>character data</title>
19 
20 with open('book.xml','r') as f:
21     parser.Parse(f.read())

#Outcome：

element: book, attrs:{}
element: title, attrs:{'lang':'eng'}
title's text is Learn Python
end element: title
element: price, attrs:{}
price's text is 100
end element: price
end element: book
....

SAX Example： url = ""http://www.ip138.com/post/""

Regular Expression

https://www.cnblogs.com/tlfox2006/p/8393289.html

Reg_EXP02.py

 1 import re
 2 
 3 #3位数字-3到8个数字
 4 mr = re.match(r'd{3}-d{3,8}','010-23232323')
 5 print(mr.string)
 6 
 7 #加（）分组
 8 mr2 = re.match(r'(d{3})-(d{3,8})','010-23232323')
 9 print(mr2.groups())
10 print(mr2.group(0))
11 print(mr2.group(1))
12 print(mr2.group(2))      
13 
14 t = '2:15:45'
15 tm = re.match(r'(d{0,2}):(d{0,2}):(d{0,2})',t)
16 print(tm.groups())
17 print(tm.group(0))
18 
19 #加（）分组
20 tm = re.match(r'(d{0,2}):(d{0,2}):(d{0,2})',t)
21 
22 #分割字符串
23 p = re.compile(r'd+')
24 print(p.split('one1two22three333'))

Selenium

https://selenium-python.readthedocs.io/index.html

回头看。