爬虫实战2 亚马逊

import requests
r= requests.get('https://www.amazon.cn/dp/B01MYH8A99')
print(r.status_code)
r.encoding = r.apparent_encoding
print(r.text)
print(r.request.headers)

503

部分截取

div class="a-box-inner">
<i class="a-icon a-icon-alert"></i>
<h4>请输入您在下方看到的字符</h4>
<p class="a-last">抱歉,我们只是想确认一下当前访问者并非自动程序。为了达到最佳效果,请确保您浏览器上的 Cookie 已启用。</p>
</div>
</div>

{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

'python-requests/2.18.4 其实在实战1就讲过这是一条爬虫请求,被对方拒绝了,像实战1一样现在我们更改头部来模拟浏览器

200
{'user-agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]> <html lang="zh-CN" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]> <html lang="zh-CN" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="zh-CN"><!--<![endif]--><head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title dir="ltr">Amazon CAPTCHA</title>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="https://images-na.ssl-images-amazon.com/images/G/01/AUIClients/AmazonUI-3c913031596ca78a3768f4e934b1cc02ce238101.secure.min._V1_.css">
<script>

if (true === true) {
var ue_t0 = (+ new Date()),
ue_csm = window,
ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
ue_furl = "fls-cn.amazon.cn",
ue_mid = "AAHKV2X7AFYLW",

代码框架

import requests
def getHtmlText(url):
    try:
        kv = {'user-agent': 'Mozilla/5.0'}
        r = requests.get(url, headers=kv)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text[1000:2000]
    except:
        return  '产生异常'

if __name__ == '__main__':
    url='https://www.amazon.cn/gp/product/B01MTMZYBE/ref=s9_acss_bw_cg_Kindle_11a1_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-1&pf_rd_r=T8Y1JWWVNAA1AM9KE1SZ&pf_rd_t=101&pf_rd_p=ac9fd05e-c480-475b-a825-83c445252a6d&pf_rd_i=1991234071'
    print(getHtmlText(url))
原文地址:https://www.cnblogs.com/tingtin/p/12904620.html