Python for Data Science

Chapter 6 - Data Sourcing via Web

Segment 4 - Web scraping

from bs4 import BeautifulSoup
import urllib.request
from IPython.display import HTML
import re
r = urllib.request.urlopen('https://analytics.usa.gov/').read()
soup = BeautifulSoup(r, "lxml")
type(soup)
bs4.BeautifulSoup
print(soup.prettify()[:100])
<!DOCTYPE html>
<html lang="en">
 <!-- Initalize title and data source variables -->
 <head>
  <!--
for link in soup.find_all('a'):
    print(link.get('href'))
/
#explanation
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
data/
#top-pages-realtime
#top-pages-7-days
#top-pages-30-days
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
mailto:DAP@support.digitalgov.gov
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
mailto:DAP@support.digitalgov.gov
https://github.com/GSA/analytics.usa.gov/issues
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
http://www.gsa.gov/
https://www.digitalgov.gov/services/dap/
https://cloud.gov/
print(soup.get_text())














analytics.usa.gov | The US government's web traffic.





















analytics.usa.gov


About this site
Data | API


Select an agency

All Participating Websites
Agency for International Development
Department of Agriculture
Department of Commerce
Department of Defense
Department of Education
Department of Energy
Department of Health and Human Services
Department of Homeland Security
Department of Housing and Urban Development
Department of Justice
Department of Labor
Department of State
Department of Transportation
Department of Veterans Affairs
Department of the Interior
Department of the Treasury
Environmental Protection Agency
Executive Office of the President
General Services Administration
National Aeronautics and Space Administration
National Archives and Records Administration
National Science Foundation
Nuclear Regulatory Commission
Office of Personnel Management
Postal Service
Small Business Administration
Social Security Administration








...
people on government websites now

Visits Today
Eastern Time





Visits in the Past 90 Days

          There were ... visits over the past 90 days.

Devices




            Based on rough network segmentation data, we estimate that less than 5% of all traffic across all agencies comes from US federal government networks.

            Much more detailed data is available in downloadable CSV and JSON. This includes data on combined browser and OS usage.


Browsers




Internet Explorer




Operating Systems




Windows






Visitor Locations Right Now

Cities





Countries




United States & Territories



International







Top Pages

Now
7 Days
30 Days


              People on a single, specific page now. We only count pages with at least 10 people on the page.
              Download the full dataset.




Visits over the last week to domains, including traffic to all pages within that domain.




              Visits over the last month to domains, including traffic to all pages within that domain. We only count pages with at least 1,000 visits in the last month.
              Download the full dataset.





Top Downloads
Total file downloads yesterday on government domains.







About this Site

            These data provide a window into how people are interacting with the government online.
             The data come from a unified Google Analytics account for U.S. federal government agencies known as the Digital Analytics Program.
              This program helps government agencies understand how people find, access, and use government services online. The program does not track individuals,
               and anonymizes the IP addresses of visitors.

            Not every government website is represented in these data. 
            Currently, the Digital Analytics Program collects web traffic from around 400 executive branch government domains,
             across about 5,700 total websites,
              including every cabinet department.
               We continue to pursue and add more sites frequently; to add your site, email the Digital Analytics Program.


Download the data
You can download the data here. Available in JSON and CSV format.
 Additionally, you can access data via our  API project (currently in Beta).
A note on sampling
Due to varying Google Analytics API sampling thresholds and the sheer volume of data in this project, some non-realtime reports may be subject to sampling. 
             The data are intended to represent trends and numbers may not be precise.





Have a question or problem? 
              
              Get in touch.


                  Suggest a feature or report an issue




              View our code on GitHub

              View our code for the data on GitHub









Analytics.usa.gov is a project of GSA’s Digital Analytics Program.
This website is hosted on cloud.gov.











print(soup.prettify()[0:1000])
<!DOCTYPE html>
<html lang="en">
 <!-- Initalize title and data source variables -->
 <head>
  <!--

    Hi! Welcome to our source code.

    This dashboard uses data from the Digital Analytics Program, a US
    government team inside the General Services Administration.

    For a detailed tech breakdown of how 18F and friends built this site:

    https://18f.gsa.gov/2015/03/19/how-we-built-analytics-usa-gov/

    This is a fully open source project, and your contributions are welcome.

    Frontend static site: https://github.com/18F/analytics.usa.gov
    Backend data reporting: https://github.com/18F/analytics-reporter

    -->
  <meta charset="utf-8"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="NjbZn6hQe7OwV-nTsa6nLmtrOUcSGPRyFjxm5zkmCcg" name="google-site-verification"/>
  <link href="/css/vendor/css/uswds.v0.9.6.css" rel="stylesheet"/>
  <link href="/css/public_analytics.css" rel="stylesheet"/>
  <link href="/images/analytics-favicon.ico" rel="ic
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
    print(link)
type(link)
<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank">API</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank"> API project</a>
<a class="usa-button usa-button-secondary-inverse" href="https://github.com/GSA/analytics.usa.gov/issues">
<img alt="Github Icon" class="github-icon" src="/images/github-logo-white.svg"/>
                  Suggest a feature or report an issue
            </a>
<a href="https://github.com/GSA/analytics.usa.gov">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
              View our code on GitHub</a>
<a href="https://github.com/18F/analytics-reporter">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
              View our code for the data on GitHub</a>
<a href="http://www.gsa.gov/">
<img alt="GSA" src="/images/gsa-logo.svg"/>
</a>
<a href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a href="https://cloud.gov/">cloud.gov</a>





bs4.element.Tag
file = open("parsed_data.txt", "w")
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
    soup_link = str(link)
    print(soup_link)
    file.write(soup_link)
file.flush()
file.close()
<a href="https://analytics.usa.gov/data/">Data</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank">API</a>
<a href="https://analytics.usa.gov/data/live/all-pages-realtime.csv">Download the full dataset.</a>
<a href="https://analytics.usa.gov/data/live/all-domains-30-days.csv">Download the full dataset.</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a class="external-link" href="https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4">does not track individuals</a>
<a class="external-link" href="https://support.google.com/analytics/answer/2763052?hl=en">anonymizes the IP addresses</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/second-level-domains.csv">400 executive branch government domains</a>
<a class="external-link" href="https://analytics.usa.gov/data/live/sites.csv">about 5,700 total websites</a>
<a href="https://analytics.usa.gov/data/">download the data here.</a>
<a href="https://open.gsa.gov/api/dap/" rel="noopener" target="_blank"> API project</a>
<a class="usa-button usa-button-secondary-inverse" href="https://github.com/GSA/analytics.usa.gov/issues">
<img alt="Github Icon" class="github-icon" src="/images/github-logo-white.svg"/>
                  Suggest a feature or report an issue
            </a>
<a href="https://github.com/GSA/analytics.usa.gov">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
              View our code on GitHub</a>
<a href="https://github.com/18F/analytics-reporter">
<img alt="Github Icon" class="github-icon" src="/images/github-logo.svg"/>
              View our code for the data on GitHub</a>
<a href="http://www.gsa.gov/">
<img alt="GSA" src="/images/gsa-logo.svg"/>
</a>
<a href="https://www.digitalgov.gov/services/dap/">Digital Analytics Program</a>
<a href="https://cloud.gov/">cloud.gov</a>
%pwd
'/home/ericwei/Ex_Files_Python_Data_Science_EssT_Pt_1/Exercise Files/06_04_begin'
原文地址:https://www.cnblogs.com/keepmoving1113/p/14286921.html