beautifulsoup

   1 文章目录
   2 1.解析库
   3 2.基本使用
   4 3.标签选择器
   5 3.1选择元素
   6 3.2获取名称
   7 3.3获取属性
   8 3.4获取内容
   9 3.5嵌套选择
  10 3.6子节点和子孙节点
  11 3.7父节点和祖先节点
  12 3.8兄弟节点
  13 4标准选择器
  14 4.1find_all( name , attrs , recursive , text , **kwargs )
  15 4.1.1name
  16 4.1.2attrs
  17 4.1.3text
  18 4.2find( name , attrs , recursive , text , **kwargs )
  19 4.3find_parents() find_parent()
  20 4.4find_next_siblings() find_next_sibling()
  21 4.5find_previous_siblings() find_previous_sibling()
  22 4.6find_all_next() find_next()
  23 4.7find_all_previous() 和 find_previous()
  24 5.CSS选择器
  25 5.1获取属性
  26 5.2获取内容
  27 6.总结
  28 1.解析库
  29 灵活又方便的网页解析库,处理高效,支持多种解析器。
  30 利用它不用编写正则表达式即可方便地实现网页信息的提取。
  31 安装:pip3 install BeautifulSoup4
  32 
  33 解析器    使用方法    优势    劣势
  34 Python标准库    BeautifulSoup(markup, “html.parser”)    Python的内置标准库、执行速度适中 、文档容错能力强    Python 2.7.3 or 3.2.2)前的版本中文容错能力差
  35 lxml HTML 解析器    BeautifulSoup(markup, “lxml”)    速度快、文档容错能力强    需要安装C语言库
  36 lxml XML 解析器    BeautifulSoup(markup, “xml”)    速度快、唯一支持XML的解析器    需要安装C语言库
  37 html5lib    BeautifulSoup(markup, “html5lib”)    最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档    速度慢、不依赖外部扩展
  38 2.基本使用
  39 html = """
  40 <html><head><title>The Dormouse's story</title></head>
  41 <body>
  42 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
  43 <p class="story">Once upon a time there were three little sisters; and their names were
  44 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
  45 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
  46 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
  47 and they lived at the bottom of a well.</p>
  48 <p class="story">...</p>
  49 """
  50 from bs4 import BeautifulSoup
  51 soup = BeautifulSoup(html, 'lxml')
  52 print(soup.prettify())
  53 print(soup.title.string)
  54 1
  55 2
  56 3
  57 4
  58 5
  59 6
  60 7
  61 8
  62 9
  63 10
  64 11
  65 12
  66 13
  67 14
  68 15
  69 <html>
  70  <head>
  71   <title>
  72    The Dormouse's story
  73   </title>
  74  </head>
  75  <body>
  76   <p class="title" name="dromouse">
  77    <b>
  78     The Dormouse's story
  79    </b>
  80   </p>
  81   <p class="story">
  82    Once upon a time there were three little sisters; and their names were
  83    <a class="sister" href="http://example.com/elsie" id="link1">
  84     <!-- Elsie -->
  85    </a>
  86    ,
  87    <a class="sister" href="http://example.com/lacie" id="link2">
  88     Lacie
  89    </a>
  90    and
  91    <a class="sister" href="http://example.com/tillie" id="link3">
  92     Tillie
  93    </a>
  94    ;
  95 and they lived at the bottom of a well.
  96   </p>
  97   <p class="story">
  98    ...
  99   </p>
 100  </body>
 101 </html>
 102 The Dormouse's story
 103 1
 104 2
 105 3
 106 4
 107 5
 108 6
 109 7
 110 8
 111 9
 112 10
 113 11
 114 12
 115 13
 116 14
 117 15
 118 16
 119 17
 120 18
 121 19
 122 20
 123 21
 124 22
 125 23
 126 24
 127 25
 128 26
 129 27
 130 28
 131 29
 132 30
 133 31
 134 32
 135 33
 136 34
 137 3.标签选择器
 138 3.1选择元素
 139 html = """
 140 <html><head><title>The Dormouse's story</title></head>
 141 <body>
 142 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 143 <p class="story">Once upon a time there were three little sisters; and their names were
 144 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 145 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 146 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 147 and they lived at the bottom of a well.</p>
 148 <p class="story">...</p>
 149 """
 150 from bs4 import BeautifulSoup
 151 soup = BeautifulSoup(html, 'lxml')
 152 print(soup.title)
 153 print(type(soup.title))
 154 print(soup.head)
 155 print(soup.p)
 156 1
 157 2
 158 3
 159 4
 160 5
 161 6
 162 7
 163 8
 164 9
 165 10
 166 11
 167 12
 168 13
 169 14
 170 15
 171 16
 172 17
 173 <title>The Dormouse's story</title>
 174 <class 'bs4.element.Tag'>
 175 <head><title>The Dormouse's story</title></head>
 176 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 177 1
 178 2
 179 3
 180 4
 181 3.2获取名称
 182 html = """
 183 <html><head><title>The Dormouse's story</title></head>
 184 <body>
 185 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 186 <p class="story">Once upon a time there were three little sisters; and their names were
 187 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 188 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 189 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 190 and they lived at the bottom of a well.</p>
 191 <p class="story">...</p>
 192 """
 193 from bs4 import BeautifulSoup
 194 soup = BeautifulSoup(html, 'lxml')
 195 print(soup.title.name)
 196 1
 197 2
 198 3
 199 4
 200 5
 201 6
 202 7
 203 8
 204 9
 205 10
 206 11
 207 12
 208 13
 209 14
 210 title
 211 1
 212 3.3获取属性
 213 html = """
 214 <html><head><title>The Dormouse's story</title></head>
 215 <body>
 216 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 217 <p class="story">Once upon a time there were three little sisters; and their names were
 218 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 219 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 220 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 221 and they lived at the bottom of a well.</p>
 222 <p class="story">...</p>
 223 """
 224 from bs4 import BeautifulSoup
 225 soup = BeautifulSoup(html, 'lxml')
 226 print(soup.p.attrs['name'])
 227 print(soup.p['name'])
 228 1
 229 2
 230 3
 231 4
 232 5
 233 6
 234 7
 235 8
 236 9
 237 10
 238 11
 239 12
 240 13
 241 14
 242 15
 243 dromouse
 244 dromouse
 245 1
 246 2
 247 3.4获取内容
 248 html = """
 249 <html><head><title>The Dormouse's story</title></head>
 250 <body>
 251 <p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
 252 <p class="story">Once upon a time there were three little sisters; and their names were
 253 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 254 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 255 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 256 and they lived at the bottom of a well.</p>
 257 <p class="story">...</p>
 258 """
 259 from bs4 import BeautifulSoup
 260 soup = BeautifulSoup(html, 'lxml')
 261 print(soup.p.string)
 262 1
 263 2
 264 3
 265 4
 266 5
 267 6
 268 7
 269 8
 270 9
 271 10
 272 11
 273 12
 274 13
 275 14
 276 The Dormouse's story
 277 1
 278 3.5嵌套选择
 279 html = """
 280 <html><head><title>The Dormouse's story</title></head>
 281 <body>
 282 <p class="title" name="dromouse"><b>The Dormouse's story</b></p>
 283 <p class="story">Once upon a time there were three little sisters; and their names were
 284 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
 285 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 286 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 287 and they lived at the bottom of a well.</p>
 288 <p class="story">...</p>
 289 """
 290 from bs4 import BeautifulSoup
 291 soup = BeautifulSoup(html, 'lxml')
 292 print(soup.head.title.string)
 293 1
 294 2
 295 3
 296 4
 297 5
 298 6
 299 7
 300 8
 301 9
 302 10
 303 11
 304 12
 305 13
 306 14
 307 The Dormouse's story
 308 1
 309 3.6子节点和子孙节点
 310 html = """
 311 <html>
 312     <head>
 313         <title>The Dormouse's story</title>
 314     </head>
 315     <body>
 316         <p class="story">
 317             Once upon a time there were three little sisters; and their names were
 318             <a href="http://example.com/elsie" class="sister" id="link1">
 319                 <span>Elsie</span>
 320             </a>
 321             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
 322             and
 323             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
 324             and they lived at the bottom of a well.
 325         </p>
 326         <p class="story">...</p>
 327 """
 328 from bs4 import BeautifulSoup
 329 soup = BeautifulSoup(html, 'lxml')
 330 print(soup.p.contents)
 331 1
 332 2
 333 3
 334 4
 335 5
 336 6
 337 7
 338 8
 339 9
 340 10
 341 11
 342 12
 343 13
 344 14
 345 15
 346 16
 347 17
 348 18
 349 19
 350 20
 351 21
 352 ['
            Once upon a time there were three little sisters; and their names were
            ', <a class="sister" href="http://example.com/elsie" id="link1">
 353 <span>Elsie</span>
 354 </a>, '
', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' 
            and
            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '
            and they lived at the bottom of a well.
        ']
 355 1
 356 2
 357 3
 358 html = """
 359 <html>
 360     <head>
 361         <title>The Dormouse's story</title>
 362     </head>
 363     <body>
 364         <p class="story">
 365             Once upon a time there were three little sisters; and their names were
 366             <a href="http://example.com/elsie" class="sister" id="link1">
 367                 <span>Elsie</span>
 368             </a>
 369             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
 370             and
 371             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
 372             and they lived at the bottom of a well.
 373         </p>
 374         <p class="story">...</p>
 375 """
 376 from bs4 import BeautifulSoup
 377 soup = BeautifulSoup(html, 'lxml')
 378 print(soup.p.children)
 379 for i, child in enumerate(soup.p.children):
 380     print(i, child)
 381 1
 382 2
 383 3
 384 4
 385 5
 386 6
 387 7
 388 8
 389 9
 390 10
 391 11
 392 12
 393 13
 394 14
 395 15
 396 16
 397 17
 398 18
 399 19
 400 20
 401 21
 402 22
 403 23
 404 <list_iterator object at 0x1064f7dd8>
 405 0 
 406             Once upon a time there were three little sisters; and their names were
 407             
 408 1 <a class="sister" href="http://example.com/elsie" id="link1">
 409 <span>Elsie</span>
 410 </a>
 411 2 
 412 
 413 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 414 4  
 415             and
 416             
 417 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 418 6 
 419             and they lived at the bottom of a well.
 420 1
 421 2
 422 3
 423 4
 424 5
 425 6
 426 7
 427 8
 428 9
 429 10
 430 11
 431 12
 432 13
 433 14
 434 15
 435 16
 436 html = """
 437 <html>
 438     <head>
 439         <title>The Dormouse's story</title>
 440     </head>
 441     <body>
 442         <p class="story">
 443             Once upon a time there were three little sisters; and their names were
 444             <a href="http://example.com/elsie" class="sister" id="link1">
 445                 <span>Elsie</span>
 446             </a>
 447             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
 448             and
 449             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
 450             and they lived at the bottom of a well.
 451         </p>
 452         <p class="story">...</p>
 453 """
 454 from bs4 import BeautifulSoup
 455 soup = BeautifulSoup(html, 'lxml')
 456 print(soup.p.descendants)
 457 for i, child in enumerate(soup.p.descendants):
 458     print(i, child)
 459 1
 460 2
 461 3
 462 4
 463 5
 464 6
 465 7
 466 8
 467 9
 468 10
 469 11
 470 12
 471 13
 472 14
 473 15
 474 16
 475 17
 476 18
 477 19
 478 20
 479 21
 480 22
 481 23
 482 <generator object descendants at 0x10650e678>
 483 0 
 484             Once upon a time there were three little sisters; and their names were
 485             
 486 1 <a class="sister" href="http://example.com/elsie" id="link1">
 487 <span>Elsie</span>
 488 </a>
 489 2 
 490 
 491 3 <span>Elsie</span>
 492 4 Elsie
 493 5 
 494 
 495 6 
 496 
 497 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 498 8 Lacie
 499 9  
 500             and
 501             
 502 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 503 11 Tillie
 504 12 
 505             and they lived at the bottom of a well.
 506 1
 507 2
 508 3
 509 4
 510 5
 511 6
 512 7
 513 8
 514 9
 515 10
 516 11
 517 12
 518 13
 519 14
 520 15
 521 16
 522 17
 523 18
 524 19
 525 20
 526 21
 527 22
 528 23
 529 24
 530 3.7父节点和祖先节点
 531 html = """
 532 <html>
 533     <head>
 534         <title>The Dormouse's story</title>
 535     </head>
 536     <body>
 537         <p class="story">
 538             Once upon a time there were three little sisters; and their names were
 539             <a href="http://example.com/elsie" class="sister" id="link1">
 540                 <span>Elsie</span>
 541             </a>
 542             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
 543             and
 544             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
 545             and they lived at the bottom of a well.
 546         </p>
 547         <p class="story">...</p>
 548 """
 549 from bs4 import BeautifulSoup
 550 soup = BeautifulSoup(html, 'lxml')
 551 print(soup.a.parent)
 552 1
 553 2
 554 3
 555 4
 556 5
 557 6
 558 7
 559 8
 560 9
 561 10
 562 11
 563 12
 564 13
 565 14
 566 15
 567 16
 568 17
 569 18
 570 19
 571 20
 572 21
 573 <p class="story">
 574             Once upon a time there were three little sisters; and their names were
 575             <a class="sister" href="http://example.com/elsie" id="link1">
 576 <span>Elsie</span>
 577 </a>
 578 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
 579             and
 580             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 581             and they lived at the bottom of a well.
 582         </p>
 583 1
 584 2
 585 3
 586 4
 587 5
 588 6
 589 7
 590 8
 591 9
 592 10
 593 html = """
 594 <html>
 595     <head>
 596         <title>The Dormouse's story</title>
 597     </head>
 598     <body>
 599         <p class="story">
 600             Once upon a time there were three little sisters; and their names were
 601             <a href="http://example.com/elsie" class="sister" id="link1">
 602                 <span>Elsie</span>
 603             </a>
 604             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
 605             and
 606             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
 607             and they lived at the bottom of a well.
 608         </p>
 609         <p class="story">...</p>
 610 """
 611 from bs4 import BeautifulSoup
 612 soup = BeautifulSoup(html, 'lxml')
 613 print(list(enumerate(soup.a.parents)))
 614 1
 615 2
 616 3
 617 4
 618 5
 619 6
 620 7
 621 8
 622 9
 623 10
 624 11
 625 12
 626 13
 627 14
 628 15
 629 16
 630 17
 631 18
 632 19
 633 20
 634 21
 635 [(0, <p class="story">
 636             Once upon a time there were three little sisters; and their names were
 637             <a class="sister" href="http://example.com/elsie" id="link1">
 638 <span>Elsie</span>
 639 </a>
 640 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
 641             and
 642             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 643             and they lived at the bottom of a well.
 644         </p>), (1, <body>
 645 <p class="story">
 646             Once upon a time there were three little sisters; and their names were
 647             <a class="sister" href="http://example.com/elsie" id="link1">
 648 <span>Elsie</span>
 649 </a>
 650 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
 651             and
 652             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 653             and they lived at the bottom of a well.
 654         </p>
 655 <p class="story">...</p>
 656 </body>), (2, <html>
 657 <head>
 658 <title>The Dormouse's story</title>
 659 </head>
 660 <body>
 661 <p class="story">
 662             Once upon a time there were three little sisters; and their names were
 663             <a class="sister" href="http://example.com/elsie" id="link1">
 664 <span>Elsie</span>
 665 </a>
 666 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
 667             and
 668             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 669             and they lived at the bottom of a well.
 670         </p>
 671 <p class="story">...</p>
 672 </body></html>), (3, <html>
 673 <head>
 674 <title>The Dormouse's story</title>
 675 </head>
 676 <body>
 677 <p class="story">
 678             Once upon a time there were three little sisters; and their names were
 679             <a class="sister" href="http://example.com/elsie" id="link1">
 680 <span>Elsie</span>
 681 </a>
 682 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
 683             and
 684             <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 685             and they lived at the bottom of a well.
 686         </p>
 687 <p class="story">...</p>
 688 </body></html>)]
 689 1
 690 2
 691 3
 692 4
 693 5
 694 6
 695 7
 696 8
 697 9
 698 10
 699 11
 700 12
 701 13
 702 14
 703 15
 704 16
 705 17
 706 18
 707 19
 708 20
 709 21
 710 22
 711 23
 712 24
 713 25
 714 26
 715 27
 716 28
 717 29
 718 30
 719 31
 720 32
 721 33
 722 34
 723 35
 724 36
 725 37
 726 38
 727 39
 728 40
 729 41
 730 42
 731 43
 732 44
 733 45
 734 46
 735 47
 736 48
 737 49
 738 50
 739 51
 740 52
 741 53
 742 54
 743 3.8兄弟节点
 744 html = """
 745 <html>
 746     <head>
 747         <title>The Dormouse's story</title>
 748     </head>
 749     <body>
 750         <p class="story">
 751             Once upon a time there were three little sisters; and their names were
 752             <a href="http://example.com/elsie" class="sister" id="link1">
 753                 <span>Elsie</span>
 754             </a>
 755             <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
 756             and
 757             <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
 758             and they lived at the bottom of a well.
 759         </p>
 760         <p class="story">...</p>
 761 """
 762 from bs4 import BeautifulSoup
 763 soup = BeautifulSoup(html, 'lxml')
 764 print(list(enumerate(soup.a.next_siblings)))
 765 print(list(enumerate(soup.a.previous_siblings)))
 766 1
 767 2
 768 3
 769 4
 770 5
 771 6
 772 7
 773 8
 774 9
 775 10
 776 11
 777 12
 778 13
 779 14
 780 15
 781 16
 782 17
 783 18
 784 19
 785 20
 786 21
 787 22
 788 [(0, '
'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' 
            and
            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '
            and they lived at the bottom of a well.
        ')]
 789 [(0, '
            Once upon a time there were three little sisters; and their names were
            ')]
 790 1
 791 2
 792 4标准选择器
 793 4.1find_all( name , attrs , recursive , text , **kwargs )
 794 可根据标签名、属性、内容查找文档
 795 
 796 4.1.1name
 797 html='''
 798 <div class="panel">
 799     <div class="panel-heading">
 800         <h4>Hello</h4>
 801     </div>
 802     <div class="panel-body">
 803         <ul class="list" id="list-1">
 804             <li class="element">Foo</li>
 805             <li class="element">Bar</li>
 806             <li class="element">Jay</li>
 807         </ul>
 808         <ul class="list list-small" id="list-2">
 809             <li class="element">Foo</li>
 810             <li class="element">Bar</li>
 811         </ul>
 812     </div>
 813 </div>
 814 '''
 815 from bs4 import BeautifulSoup
 816 soup = BeautifulSoup(html, 'lxml')
 817 print(soup.find_all('ul'))
 818 print(type(soup.find_all('ul')[0]))
 819 1
 820 2
 821 3
 822 4
 823 5
 824 6
 825 7
 826 8
 827 9
 828 10
 829 11
 830 12
 831 13
 832 14
 833 15
 834 16
 835 17
 836 18
 837 19
 838 20
 839 21
 840 22
 841 [<ul class="list" id="list-1">
 842 <li class="element">Foo</li>
 843 <li class="element">Bar</li>
 844 <li class="element">Jay</li>
 845 </ul>, <ul class="list list-small" id="list-2">
 846 <li class="element">Foo</li>
 847 <li class="element">Bar</li>
 848 </ul>]
 849 <class 'bs4.element.Tag'>
 850 1
 851 2
 852 3
 853 4
 854 5
 855 6
 856 7
 857 8
 858 9
 859 html='''
 860 <div class="panel">
 861     <div class="panel-heading">
 862         <h4>Hello</h4>
 863     </div>
 864     <div class="panel-body">
 865         <ul class="list" id="list-1">
 866             <li class="element">Foo</li>
 867             <li class="element">Bar</li>
 868             <li class="element">Jay</li>
 869         </ul>
 870         <ul class="list list-small" id="list-2">
 871             <li class="element">Foo</li>
 872             <li class="element">Bar</li>
 873         </ul>
 874     </div>
 875 </div>
 876 '''
 877 from bs4 import BeautifulSoup
 878 soup = BeautifulSoup(html, 'lxml')
 879 for ul in soup.find_all('ul'):
 880     print(ul.find_all('li'))
 881 1
 882 2
 883 3
 884 4
 885 5
 886 6
 887 7
 888 8
 889 9
 890 10
 891 11
 892 12
 893 13
 894 14
 895 15
 896 16
 897 17
 898 18
 899 19
 900 20
 901 21
 902 22
 903 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
 904 [<li class="element">Foo</li>, <li class="element">Bar</li>]
 905 1
 906 2
 907 4.1.2attrs
 908 html='''
 909 <div class="panel">
 910     <div class="panel-heading">
 911         <h4>Hello</h4>
 912     </div>
 913     <div class="panel-body">
 914         <ul class="list" id="list-1" name="elements">
 915             <li class="element">Foo</li>
 916             <li class="element">Bar</li>
 917             <li class="element">Jay</li>
 918         </ul>
 919         <ul class="list list-small" id="list-2">
 920             <li class="element">Foo</li>
 921             <li class="element">Bar</li>
 922         </ul>
 923     </div>
 924 </div>
 925 '''
 926 from bs4 import BeautifulSoup
 927 soup = BeautifulSoup(html, 'lxml')
 928 print(soup.find_all(attrs={'id': 'list-1'}))
 929 print(soup.find_all(attrs={'name': 'elements'}))
 930 1
 931 2
 932 3
 933 4
 934 5
 935 6
 936 7
 937 8
 938 9
 939 10
 940 11
 941 12
 942 13
 943 14
 944 15
 945 16
 946 17
 947 18
 948 19
 949 20
 950 21
 951 22
 952 [<ul class="list" id="list-1" name="elements">
 953 <li class="element">Foo</li>
 954 <li class="element">Bar</li>
 955 <li class="element">Jay</li>
 956 </ul>]
 957 [<ul class="list" id="list-1" name="elements">
 958 <li class="element">Foo</li>
 959 <li class="element">Bar</li>
 960 <li class="element">Jay</li>
 961 </ul>]
 962 1
 963 2
 964 3
 965 4
 966 5
 967 6
 968 7
 969 8
 970 9
 971 10
 972 html='''
 973 <div class="panel">
 974     <div class="panel-heading">
 975         <h4>Hello</h4>
 976     </div>
 977     <div class="panel-body">
 978         <ul class="list" id="list-1">
 979             <li class="element">Foo</li>
 980             <li class="element">Bar</li>
 981             <li class="element">Jay</li>
 982         </ul>
 983         <ul class="list list-small" id="list-2">
 984             <li class="element">Foo</li>
 985             <li class="element">Bar</li>
 986         </ul>
 987     </div>
 988 </div>
 989 '''
 990 from bs4 import BeautifulSoup
 991 soup = BeautifulSoup(html, 'lxml')
 992 print(soup.find_all(id='list-1'))
 993 print(soup.find_all(class_='element'))
 994 1
 995 2
 996 3
 997 4
 998 5
 999 6
1000 7
1001 8
1002 9
1003 10
1004 11
1005 12
1006 13
1007 14
1008 15
1009 16
1010 17
1011 18
1012 19
1013 20
1014 21
1015 22
1016 [<ul class="list" id="list-1">
1017 <li class="element">Foo</li>
1018 <li class="element">Bar</li>
1019 <li class="element">Jay</li>
1020 </ul>]
1021 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
1022 1
1023 2
1024 3
1025 4
1026 5
1027 6
1028 4.1.3text
1029 html='''
1030 <div class="panel">
1031     <div class="panel-heading">
1032         <h4>Hello</h4>
1033     </div>
1034     <div class="panel-body">
1035         <ul class="list" id="list-1">
1036             <li class="element">Foo</li>
1037             <li class="element">Bar</li>
1038             <li class="element">Jay</li>
1039         </ul>
1040         <ul class="list list-small" id="list-2">
1041             <li class="element">Foo</li>
1042             <li class="element">Bar</li>
1043         </ul>
1044     </div>
1045 </div>
1046 '''
1047 from bs4 import BeautifulSoup
1048 soup = BeautifulSoup(html, 'lxml')
1049 print(soup.find_all(text='Foo'))
1050 1
1051 2
1052 3
1053 4
1054 5
1055 6
1056 7
1057 8
1058 9
1059 10
1060 11
1061 12
1062 13
1063 14
1064 15
1065 16
1066 17
1067 18
1068 19
1069 20
1070 21
1071 ['Foo', 'Foo']
1072 1
1073 4.2find( name , attrs , recursive , text , **kwargs )
1074 find返回单个元素,find_all返回所有元素
1075 
1076 html='''
1077 <div class="panel">
1078     <div class="panel-heading">
1079         <h4>Hello</h4>
1080     </div>
1081     <div class="panel-body">
1082         <ul class="list" id="list-1">
1083             <li class="element">Foo</li>
1084             <li class="element">Bar</li>
1085             <li class="element">Jay</li>
1086         </ul>
1087         <ul class="list list-small" id="list-2">
1088             <li class="element">Foo</li>
1089             <li class="element">Bar</li>
1090         </ul>
1091     </div>
1092 </div>
1093 '''
1094 from bs4 import BeautifulSoup
1095 soup = BeautifulSoup(html, 'lxml')
1096 print(soup.find('ul'))
1097 print(type(soup.find('ul')))
1098 print(soup.find('page'))
1099 1
1100 2
1101 3
1102 4
1103 5
1104 6
1105 7
1106 8
1107 9
1108 10
1109 11
1110 12
1111 13
1112 14
1113 15
1114 16
1115 17
1116 18
1117 19
1118 20
1119 21
1120 22
1121 23
1122 <ul class="list" id="list-1">
1123 <li class="element">Foo</li>
1124 <li class="element">Bar</li>
1125 <li class="element">Jay</li>
1126 </ul>
1127 <class 'bs4.element.Tag'>
1128 None
1129 1
1130 2
1131 3
1132 4
1133 5
1134 6
1135 7
1136 4.3find_parents() find_parent()
1137 find_parents()返回所有祖先节点,find_parent()返回直接父节点。
1138 
1139 4.4find_next_siblings() find_next_sibling()
1140 find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
1141 
1142 4.5find_previous_siblings() find_previous_sibling()
1143 find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
1144 
1145 4.6find_all_next() find_next()
1146 find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
1147 
1148 4.7find_all_previous() 和 find_previous()
1149 find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
1150 
1151 5.CSS选择器
1152 通过select()直接传入CSS选择器即可完成选择
1153 
1154 html='''
1155 <div class="panel">
1156     <div class="panel-heading">
1157         <h4>Hello</h4>
1158     </div>
1159     <div class="panel-body">
1160         <ul class="list" id="list-1">
1161             <li class="element">Foo</li>
1162             <li class="element">Bar</li>
1163             <li class="element">Jay</li>
1164         </ul>
1165         <ul class="list list-small" id="list-2">
1166             <li class="element">Foo</li>
1167             <li class="element">Bar</li>
1168         </ul>
1169     </div>
1170 </div>
1171 '''
1172 from bs4 import BeautifulSoup
1173 soup = BeautifulSoup(html, 'lxml')
1174 print(soup.select('.panel .panel-heading'))
1175 print(soup.select('ul li'))
1176 print(soup.select('#list-2 .element'))
1177 print(type(soup.select('ul')[0]))
1178 1
1179 2
1180 3
1181 4
1182 5
1183 6
1184 7
1185 8
1186 9
1187 10
1188 11
1189 12
1190 13
1191 14
1192 15
1193 16
1194 17
1195 18
1196 19
1197 20
1198 21
1199 22
1200 23
1201 24
1202 [<div class="panel-heading">
1203 <h4>Hello</h4>
1204 </div>]
1205 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
1206 [<li class="element">Foo</li>, <li class="element">Bar</li>]
1207 <class 'bs4.element.Tag'>
1208 1
1209 2
1210 3
1211 4
1212 5
1213 6
1214 html='''
1215 <div class="panel">
1216     <div class="panel-heading">
1217         <h4>Hello</h4>
1218     </div>
1219     <div class="panel-body">
1220         <ul class="list" id="list-1">
1221             <li class="element">Foo</li>
1222             <li class="element">Bar</li>
1223             <li class="element">Jay</li>
1224         </ul>
1225         <ul class="list list-small" id="list-2">
1226             <li class="element">Foo</li>
1227             <li class="element">Bar</li>
1228         </ul>
1229     </div>
1230 </div>
1231 '''
1232 from bs4 import BeautifulSoup
1233 soup = BeautifulSoup(html, 'lxml')
1234 for ul in soup.select('ul'):
1235     print(ul.select('li'))
1236 1
1237 2
1238 3
1239 4
1240 5
1241 6
1242 7
1243 8
1244 9
1245 10
1246 11
1247 12
1248 13
1249 14
1250 15
1251 16
1252 17
1253 18
1254 19
1255 20
1256 21
1257 22
1258 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
1259 [<li class="element">Foo</li>, <li class="element">Bar</li>]
1260 1
1261 2
1262 5.1获取属性
1263 html='''
1264 <div class="panel">
1265     <div class="panel-heading">
1266         <h4>Hello</h4>
1267     </div>
1268     <div class="panel-body">
1269         <ul class="list" id="list-1">
1270             <li class="element">Foo</li>
1271             <li class="element">Bar</li>
1272             <li class="element">Jay</li>
1273         </ul>
1274         <ul class="list list-small" id="list-2">
1275             <li class="element">Foo</li>
1276             <li class="element">Bar</li>
1277         </ul>
1278     </div>
1279 </div>
1280 '''
1281 from bs4 import BeautifulSoup
1282 soup = BeautifulSoup(html, 'lxml')
1283 for ul in soup.select('ul'):
1284     print(ul['id'])
1285     print(ul.attrs['id'])
1286 1
1287 2
1288 3
1289 4
1290 5
1291 6
1292 7
1293 8
1294 9
1295 10
1296 11
1297 12
1298 13
1299 14
1300 15
1301 16
1302 17
1303 18
1304 19
1305 20
1306 21
1307 22
1308 23
1309 list-1
1310 list-1
1311 list-2
1312 list-2
1313 1
1314 2
1315 3
1316 4
1317 5.2获取内容
1318 html='''
1319 <div class="panel">
1320     <div class="panel-heading">
1321         <h4>Hello</h4>
1322     </div>
1323     <div class="panel-body">
1324         <ul class="list" id="list-1">
1325             <li class="element">Foo</li>
1326             <li class="element">Bar</li>
1327             <li class="element">Jay</li>
1328         </ul>
1329         <ul class="list list-small" id="list-2">
1330             <li class="element">Foo</li>
1331             <li class="element">Bar</li>
1332         </ul>
1333     </div>
1334 </div>
1335 '''
1336 from bs4 import BeautifulSoup
1337 soup = BeautifulSoup(html, 'lxml')
1338 for li in soup.select('li'):
1339     print(li.get_text())
1340 1
1341 2
1342 3
1343 4
1344 5
1345 6
1346 7
1347 8
1348 9
1349 10
1350 11
1351 12
1352 13
1353 14
1354 15
1355 16
1356 17
1357 18
1358 19
1359 20
1360 21
1361 22
1362 Foo
1363 Bar
1364 Jay
1365 Foo
1366 Bar
beautifulsoup

 https://blog.csdn.net/qq_42554007/article/details/90675142

原文地址:https://www.cnblogs.com/wangbin2020/p/13696529.html