MiniCrowler

MiniCrawler

Github Path :

https://github.com/LixinZhang/miniCrowler

Introduction:

MiniCrawler is a simple web crawler implemented by Python.
Threadpool tech is used to speed up fetching pages.
One can config the crawler through modify the file config.py. And start the crawling job using python run.py.
The webs pages fetched will be stored in pages folder.
check_status.py helps you check the job's status as following:

Rank            Hostname        Times   
----------------------------------------
   1             buaa.edu.cn        40  
   2             baixing.com        32  
   3             cnblogs.com        29  
   4              hao123.com         5  
   5           xinhuanet.com         2  
   6          visionplaza.cn         2  
   7           people.com.cn         2  
   8                  org.cn         2  
   9                 news.cn         2  
  10             most.gov.cn         2

More Detail

You can find more detail in my Chinese Blog. Python 多线程抓取网页