ruby on Httpwatch 脚本

HTTPwatch官方：http://www.httpwatch.com/rubywatir/

ruby on httpwatch例子：http://www.httpwatch.com/rubywatir/site_spider.zip （这个例子官网可能更新）

得到这个例子后做了一些中文注释，对一些代码进行了删减，主要修改内容如下：

1、在url = gets.chomp!上面添加($*[0].nil?)?(url = url):(url = $*[0])，目前URL可以在命令行加载，也可以在脚本中固定；命令行方式用法：ruby 脚本名网站名，具体的用法请参看脚本中的注释，说明一下在URL前面不要添加http://

2、注视掉了两个break，在ruby186版本没有问题，在ruby192这样的高版本上会有错，需要注视掉

3、注视掉 plugin.Container.Quit(); 即不退出IE，运行完毕后，测试人员需要去查看结果

运行时问题：如果测试机网速较低可能出现超时而退出

C:/Ruby192/lib/ruby/gems/1.9.1/gems/watir-classic-3.0.0/lib/watir-classic/ie-cla
ss.rb:374:in `method_missing': (in OLE method `navigate': ) (WIN32OLERuntimeErro
r)
    OLE error code:800C000E in <Unknown>
      <No Description>
    HRESULT error code:0x80020009
      发生意外。
        from C:/Ruby192/lib/ruby/gems/1.9.1/gems/watir-classic-3.0.0/lib/watir-c
lassic/ie-class.rb:374:in `goto'
        from C:/Documents and Settings/Administrator/桌面/site_spider/site_spide
r.rb:55:in `<main>'

site_spider.rb

  1 # A Site Spider that use HttpWatch, Ruby And Watir
  2 # 
  3 # For more information about this example please refer to http://www.httpwatch.com/rubywatir/
  4 #
  5 MAX_NO_PAGES = 200    #一次访问多少个页面，由MAX_ON_PAGES控制
  6 
  7 require 'win32ole'        # win32ole来驱动HttpWatch工具，HttpWatch6.0以下版本不能调用
  8 require 'rubygems'
  9 require 'watir'
 10 require './url_ops.rb'    # url_ops.rb要放在该脚本的同一目录下
 11 url = "www.gaopeng.com/?ADTAG=beijing_from_beijing"        #要测试的URL，也可以在命令行读取前面不要添加http://
 12 
 13 # Create HttpWatch
 14 control = WIN32OLE.new('HttpWatch.Controller')
 15 httpWatchVer = control.Version
 16 if httpWatchVer[0...1] == "4" or httpWatchVer[0...1] == "5"
 17     puts "\nERROR: You are running HttpWatch #{httpWatchVer}. This sample requires HttpWatch 6.0 or later. Press Enter to exit...";  $stdout.flush
 18     gets
 19     #break        #ruby186版本没有问题，在ruby192这样的高版本上会有错，需要注视掉
 20 end
 21 
 22 # Get the domain name to spider
 23 puts "Enter the domain name of the site to check (press enter for url):\n";  $stdout.flush
 24 ($*[0].nil?)?(url = url):(url = $*[0])  #从命令行传文件名过去,优先读取命令行的
 25 #url = gets.chomp!   #如果添加上面一行的代码，必须注视这一行
 26 if  url.empty? 
 27     url = url
 28 end
 29 hostName =url.HostName
 30 if  hostName.empty? 
 31     puts "\nPlease enter a valid domain name. Press Enter to exit...";  $stdout.flush
 32     gets
 33     #break        #ruby186版本没有问题，在ruby192这样的高版本上会有错，需要注视掉
 34 end
 35 
 36 # 启动IE
 37 ie = Watir::IE.new
 38 ie.logger.level = Logger::ERROR
 39 
 40 # 定位IE窗口
 41 plugin = control.ie.Attach(ie.ie)
 42 
 43 # 开始记录HTTP流量
 44 plugin.Clear()
 45 plugin.Log.EnableFilter(false)
 46 plugin.Record()
 47 
 48 
 49 url = url.CanonicalUrl
 50 urlsVisited = Array.new;  urlsToVisit = Array.new( 1, url )
 51 # 开始访问页面
 52 
 53 while urlsToVisit.length > 0 && urlsVisited.length < MAX_NO_PAGES
 54 
 55     nextUrl= urlsToVisit.pop
 56     puts "Loading " + nextUrl + "...";   $stdout.flush
 57     
 58     ie.goto(nextUrl)            # get WATIR to load URL
 59     urlsVisited.push( nextUrl)    # store this URL in the list that has been visited
 60   
 61   begin
 62     # Look at each link on the page and decide if it needs to be visited
 63     ie.links().each() do |link|
 64         
 65         linkUrl = link.href.CanonicalUrl
 66         # if the url has already been accessed or if it is a download or if it from a different domain
 67         if !url.IsSubDomain( linkUrl.HostName ) ||
 68            linkUrl.Path.include?( ".exe" ) || linkUrl.Path.include?(".zip") || linkUrl.Path.include?(".csv") || 
 69            linkUrl.Path.include?( ".pdf" ) || linkUrl.Path.include?( ".png" ) ||
 70            urlsToVisit.find{ |aUrl| aUrl == linkUrl}  != nil ||
 71            urlsVisited.find{ |aUrl| aUrl == linkUrl}  != nil
 72           # Don't add this URL to the list
 73           next
 74         end
 75         # Add this URL to the list
 76         urlsToVisit.push(linkUrl)
 77       end
 78   rescue
 79     puts "Failed to find links in " + nextUrl + " " + $!;  $stdout.flush
 80   end
 81     
 82 end
 83 
 84 if ( urlsVisited.length == MAX_NO_PAGES )
 85     puts "\nThe spider has stopped because #{MAX_NO_PAGES} pages have been visited. (Change MAX_NO_PAGES if you want to increase this limit)";   $stdout.flush
 86 end
 87 
 88 # Stop Recording HTTP data in HttpWatch
 89 plugin.Stop()
 90 
 91 puts "\nAnalyzing HTTP data..";   $stdout.flush
 92 
 93 
 94 # Look at each HTTP request in the log to compile list of URLs
 95 # for each error
 96 errorUrls = Hash.new
 97 plugin.Log.Entries.each do |entry|
 98     if  !entry.Error.empty? && entry.Error != "Aborted" || entry.StatusCode >= 400
 99         if !errorUrls.has_key?(entry.Result )
100             errorUrls[entry.Result] =  Array.new( 1, entry.Url  ) 
101         else
102             if errorUrls[entry.Result].find{ |aUrl| aUrl == entry.Url } == nil 
103                 errorUrls[entry.Result].push( entry.Url  )
104             end             
105         end
106     end
107 end
108 
109 # Display summary statistics for whole log
110 summary = plugin.Log.Entries.Summary
111 
112 printf "Total time to load page (secs):      %.3f\n", summary.Time
113 printf "Number of bytes received on network: %d\n", summary.BytesReceived
114 
115 printf "HTTP compression saving (bytes):     %d\n", summary.CompressionSavedBytes
116 printf "Number of round trips:               %d\n",  summary.RoundTrips
117 printf "Number of errors:                    %d\n", summary.Errors.Count
118 
119 # Print out errors
120 summary.Errors.each do |error|
121     numErrors = error.Occurrences
122     description = error.Description
123     puts "#{numErrors} URL(s) caused a #{description} error:"
124     errorUrls[error.Result].each do |aUrl|
125         puts "-> #{aUrl}"
126     end
127 
128 end
129 
130 # 退出IE，这里注释掉，在运行完毕后，测试人员需要去查看结果
131 #plugin.Container.Quit();
132 
133 puts "\r\nPress Enter to exit";  $stdout.flush
134 #gets

url_ops.rb

 1 # Helper functions used to parse URLs
 2 class String
 3   def HostName
 4       matches = scan(/^(?:https?:\/\/)?([^\/]*)/)
 5       if matches.length > 0 && matches[0].length > 0
 6          return matches[0][0].downcase
 7       else
 8           return ""
 9       end
10   end
11   def IsSubDomain( hostName)
12     thisHostName = self.HostName
13     if thisHostName.slice(0..3) == "www."
14         thisHostName = thisHostName.slice(4..-1)
15     end
16     if thisHostName == hostName ||
17       (hostName.length > thisHostName.length &&
18        hostName.slice( -thisHostName.length ..-1) == thisHostName)
19         return true
20     end
21     return false
22   end
23   def Protocol
24       matches = scan(/^(https?:\/\/)/)
25       if matches.length > 0 && matches[0].length > 0
26           return matches[0][0].downcase
27       else
28           return "http://"
29       end
30   end  
31   def Path
32       if scan(/^(https?:\/\/)/).length > 0 
33         matches = scan(/^https?:\/\/[^\/]+\/([^#]+)$/)
34       else
35         matches = scan(/^[^\/]+\/([^#]+)$/)
36           end        
37       if matches != nil && matches.length == 1 && matches[0].length == 1
38           return matches[0][0].downcase
39       else
40           return ""
41       end
42   end   
43   def CanonicalUrl
44       return self.Protocol + self.HostName + "/" + self.Path
45   end   
46 end

两个脚本放在同一目录下，url_ops.rb未作变动，在cmd中执行即可。