生成器的测试

IP 代理

代理IP 存储的数据库clawer, smart_proxy_proxyip 表

select count(*) from smart_proxy_proxyip [where is_valid=1];

1.进行IP 爬去

from smart_proxy.cramer_proxy_ip import Cramer

cramer=Cramer()

cramer.run()

2.进行IP轮询

import smart_proxy.round_proxy_ip

smart_proxy.round_proxy_ip.run()

å 3.验证请求IP机制

from smart_proxy.api import Proxy

proxy=Proxy()

#不输入任何参数

print proxy.get_proxy()

#输入需要请求的数量/地区

print proxy.get_proxy(num=5,province='Beijing')

#获取无效的IP

print proxy.get_proxy(is_valid=False)

4.验证ip地址有效性判断

import smart_proxy.round_proxy_ip

print smart_proxy.round_proxy_ip.change_valid(10); #传入的是代理ip的id；

生成器

git pull前操作

git add clawer/settings_local.py

git add ../confs/dev/run_local.py

git commit -m “save settings_local and run_local”

git pull

准备工作：

1.修改 /confs/dev/run_local.sh 文件中

WOEKDIR=~/Projects/cr-clawer/clawer

PY=~/Projects/env/bin/python

RQWORKER=~/Projects/env/bin/rqworker

2.修改 cr-clawer/clawer/clawer/settings_local.py

PYTHON=“/home/princetechs/Projects/env/bin/python”

3.启动redis服务器

安装redis

yum install epel-release

yum install redis # 安装redis

service redis start #启动redis

redis-cli #查看是否正确启动redis

exit #退出redis

更改后：

redis-server # 启动redis

redis-cli # 启动客户端

4.启动worker

执行/confs/dev/run_local.sh该文件

./run_local.sh rq

5. 启动mongodb 服务器:

进入mongodb文件夹：cd ~/mongodb

第一次先要创建set与log文件夹. mkdir set; mkdir log; 并创建启动配置文件： vim mongo.conf

在mongo.conf中输入：

port=27017

dbpath=set/

logpath=log/mongo.log

logappend=true

启动mongod 命令:

./bin/mongod -f mongo.conf

每次使用python前，都要开启Python 虚拟环境：

source ~/Projects/env/bin/activate # 启动虚拟环境

deactivate # 离开虚拟环境

生成器调用步骤：

1.新建job

cd ~/Projects/cr-clawer/clawer # 进入manage.py sou所在目录从的

python manage.py makemigrations

python manage.py test collector.tests.test_generator.TestMongodb.test_job_save

cd mongodb/bin/

./mongo #进入mongo客户端

show dbs

use source

db.getCollectionNames()

db.job.find().pretty() #显示job表结构

2.数据预处理

python manage.py shell

from collector.utils_generator import DataPreprocess

dp = DataPreprocess('job_id_stirng')

1.schemes=['http']

dp.save(text='http://www.baidu.com',settings={'schemes':schemes})

# 查看mongo中是否正确添加

db.getCollectionNames()

db.crawler_task.find().pretty()

# 第一中直接录入urI，并制定协议

2. script="""import json print json.dumps({'uri':"http://www.baidu.com",'sdf':"sdfdf"})"""

cron="*/3 * * * *"

dp.save(script=script,settings={'cron':cron,'code_type':1})

script="""import json print json.dumps({'uri':"http://www.newbaidu.com"})"""

cron="* * * * *"

dp.save(script=script,settings={'cron':cron,'code_type':1})

crontab.task_generator_install()

# shell 脚本

script = """#!/bin/bash echo "{'uri':'http://www.shell.com'}" """

cron = "* * * * *"

dp.save(script=script,settings={'cron':cron,'code_type':2})

导入txt csv:

file_object=open('/home/princetechs/桌面/文件名')

all_the_text=file_object.read()

dp.save(text=all_the_text,settings={'schemes':schemes})

补充说明：传入参数按测试用例上的不同情况，一一测试，分手工录入、导入txt、导入csv、上传python脚本；

3.更新生成器脚本

在/cr-clawer/confs/dev/ 下新建文件 testFilename;

from collector.utils_generator import CrawlerCronTab

filename="/home/princetechs/Projects/cr-clawer/confs/dev/testFilename"

crontab = CrawlerCronTab(filename= filename)

#filename为字符串类型，要读取或保存crontab信息的文件地址。

#定期更新job与生成器脚本

crontab.task_generator_install()

4.执行crontab中任务分发命令

crontab.task_generator_run()

5. 4000条job与generator生成器脚本:

python manage.py test collector.tests.test_generator.TestPreprocess.insert_4000_jobs_with_generators

补充：功能代码路径

~/Projects/cr-clawer/clawer/collector/utils-generator.py

输入法设置参考网站：

http://blog.csdn.net/alex_my/article/details/38223449

下载器：下载器爬取uri enterprice://重庆/重庆钢铁集团电视台/50010410003471