scrapy.FormRequest与FormRequest.from

scrapy.FormRequest与FormRequest.from_response 的区别

本文档参考了github,还有自己的总结测试,并且参考了https://blog.csdn.net/qq_43546676/article/details/89043445，

一、scrapy.FormRequest：适用于以下三种情况

（1）不需要post或登录，用get方法爬去内容时候，直接用它

（2）登录，但没有登录的form(没有输入用户和口令的界面)

（3）需要post，单没有form,而是用Ajax提交post

二、FormRequest.from_response

适用于以下情况

（1）提交一个form, 有界面输入框，用来post 数据

（2）官方特别推荐的一种场景，登录界面，登录画面进入(get)和提交账号口令(post) 是同一个url的情况。

    <form action="/login" method="post" accept-charset="utf-8" >
        <input type="hidden" name="csrf_token" value="zrfpdvAFSoVQGYHsLRtBgXKZuDENhbqwOkCmMnTeIWJUlxaijycP"/>
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Username</label>
                <input type="text" class="form-control" id="username" name="username" />
            </div>
        </div>
        <div class="row">
            <div class="form-group col-xs-3">
                <label for="username">Password</label>
                <input type="password" class="form-control" id="password" name="password" />
            </div>
        </div>
        <input type="submit" value="Login" class="btn btn-primary" />        
    </form>

如上面例子，一些登录界面，除了肉眼可看到的输入用户名，密码，系统还隐藏着其他内容，作为csrf防攻击策略。作为爬虫，模拟登录时候要和input hidden 数据一起提交。

（1）如果用scrapy.FormRequest，则需要提前爬取csrf_token的值，然后，csrf_token+用户+口令一起提交。比较麻烦

（2）FormRequest.from_response，则可以无视csrf_token，from_response会自动取得csrf_token，并且和用户口令提起提交。

看官网的解释：

https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-userlogin

It is usual for web sites to provide pre-populated form fields through <input type="hidden"> elements, 
such as session related data or authentication tokens (for login pages). When scraping,
 you’ll want these fields to be automatically pre-populated and only override a couple of them, such as the user name and password.
 You can use the FormRequest.from_response() method for this job

简单翻译，使用FormRequest.from_response()会让hidden项目自动赋值，你只需要填充用户名和密码，就可以提交。
看看以下的2种提交方式，

import scrapy
from scrapy.http import FormRequest
class LoginSpider(scrapy.Spider):
    name = 'login2'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'eoin', 'password': 'eoin'},
            callback=self.parse_after_login
        )

    def parse_bak(self, response):

        token = response.xpath('//*[@name="csrf_token"]/@value').extract_first()
        yield FormRequest('http://quotes.toscrape.com/login', formdata={ 'csrf_token' : token,
         'username': 'eoin',
         'password': 'eoin'},
         callback=self.parse_after_login)

    def parse_after_login(self, response):
        print('结束！！！！')
        if response.xpath('//a[@href="/logout"]'):
            self.log(response.xpath('//a[@href="/logout"]/text()').extract_first())
            self.log("you managed to login yipee!!")
            print('登录成功！！！！')



当然，一些网站，比如github， 他的login 进入页面（get)和提交（post）页面不同，这种情况下，就只能用FormRequest,因为不能自动正确识别post的地址。

github        
self.login_url = 'https://github.com/login'
self.post_url = 'https://github.com/session'

总的来说，

(1)FormRequest.from_response比较简单，也可以进行设置 formdata，用来填写并提交表单，实现模拟登入。相当于自动识别post

(2)scrapy.FormRequest的功能更加强大，如果FormRequest.from_response 不能解决就用scrapy.FormRequest来解决模拟登入，毕竟是手动设置post目标网址，比自动识别要精准