Scrapy导出欠套型JSON

Scrapy导出欠套型JSON

scrapy如何导出类型如下结构的JSON:

[
{
	"pingPai": ["ALPINA"],
	"carTypes": [{
		"carType": ["ALPINA"],
		"carNames": {
			"carName": ["ALPINA B4",
			"ALPINA B3",
			"ALPINA D5",
			"ALPINA B7",
			"ALPINA XD3"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
},
{
	"pingPai": ["ABT"],
	"carTypes": [{
		"carType": ["ABT"],
		"carNames": {
			"carName": ["ABT A3",
			"ABT A5",
			"ABT TT"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
}    
]

解决核心

Scrapy导出欠套型JSON实质是对列表的操作

准备知识

  • python合并多个列表,直接用“+”
list1 = [1,2,3]
list2 = [5,6,7]
print(list1+list2)
# 输出:
[1, 2, 3, 5, 6, 7]
  • python合并多个字典,用“update()”(也可以用其它方法,这里只讲update)(在此文中实际上没有用到)
dic1 = {'a':'1','b':'2'}
dic2 = {'c':'3','d':'4'}
dic1.update(dic2)
print(dic1)
# 输出:
{'a': '1', 'b': '2', 'c': '3', 'd': '4'}
  • xpath匹配得到的结果实际上是一个列表
    比如,xpath匹配到一行数据就是“[X]”,X是所匹配的值
    xpath匹配到多行数据就是“[X,Y,Z....]”

解决方法

观察如上欠套JSON,1级节点是:“pingPai”、“carTypes”、“picUrl”三个字段,根据scrapy定义items.py文件的特性,我们只需要定义这三个一级节点,定义为:
打开items.py文件,添加如下代码:

class CarModelItem(scrapy.Item):
    pingPai = scrapy.Field()  # 品牌
    carTypes = scrapy.Field()  # 车型
    picUrl = scrapy.Field()  # 品牌图片

注意

此处也可以不用定义 items.py文件 直接在导出的pipelines.py文件里面使用json.dumps系列化(把对象转换为字节序列的过程称为对象的序列化。把字节序列恢复为对象的过程称为对象的反序列化。)(json.dumps将一个Python数据结构转换为JSON,json.loads将一个JSON编码的字符串转换回一个Python数据结构)原理都是一样的,只是运用了系列化而已,本文不作讨论。

要生成欠套型的JSON,我们只需要在carTypes列表内再添加列表就行(添加值为字典“{}”类型的列表就可以了)。
比如我们要爬取地址为“https://www.autohome.com.cn/grade/carhtml/A.html”这个地址的内容,打开浏览器查看效果如下:

查看源代码,图片和1级节点在“dl/dt”内,如下图:

第二节点和第三节点在“dl/dd”内,如下图:

这里是比较难处理的地方,一般这里我们要定义的列表为:列表内再添加列表(值为字典)的数据格式才能满足需求。
直接上代码:

class GetcarmodelSpider(scrapy.Spider):
    name = "GetCarModel"
    allowed_domains = ["www.autohome.com.cn"]
    chars = [
        "A",
        """
        "B",
        "C",
        "D",
        "F",
        "G",
        "H",
        "J",
        "K",
        "L",
        "M",
        "N",
        "O",
        "P",
        "Q",
        "R",
        "S",
        "T",
        "W",
        "X",
        "Y",
        "Z",""",
    ]
    start_urls = [
        "https://www.autohome.com.cn/grade/carhtml/%s.html" % i2 for i2 in chars
    ]

    def parse(self, response):
        dtArray = response.xpath("//dl[@id]")
        for dt in dtArray:
            pingPai = dt.xpath("./dt/div/a/text()").extract()
            pingPaiPicArr = dt.xpath("./dt/a/img/@src").extract()
            pingPaiPic = ""
            # 这里图片其实只有一张图片
            for cti in pingPaiPicArr:
                # carTypeImg = "http://" + cti[2:]
                pingPaiPic = parse.urljoin(response.url, cti)

            carTypesTemp = dt.xpath("./dd/div[@class='h3-tit']")
            carTypes = []
            for pp in carTypesTemp:
                print(">>>>>>>>>>>>>>>>>>>>>>", pp.xpath("./a/text()").extract())
                carTypes += [
                    {"carType": pp.xpath("./a/text()").extract(), "carNames": {}}
                ]

            # 获取具体名称
            carNameArray = dt.xpath("./dd/ul[@class='rank-list-ul']")
            carNames = []
            for cn in carNameArray:
                # 直接定义值为字典类型的列表,这样在循环第X次的时候取值就是carNames[X]
                carNames += [{"carName": cn.xpath("./li/h4/a/text()").extract()}]
                print(".......", [{"carName": cn.xpath("./li/h4/a/text()").extract()}])

            for i in range(len(carTypes)):
                try:
                    carTypes[i]["carNames"] = carNames[i]
                except Exception as e:
                    print(e)


            print("pingPai:", pingPai)
            print("pingPaiPic:", pingPaiPic)
            print("carTypes:", carTypes)
            print("carNames:", carNames)

            carModel = CarModelItem()
            carModel["pingPai"] = pingPai
            carModel["carTypes"] = carTypes
            carModel["picUrl"] = pingPaiPic
            yield carModel

注意,导出JSON方法这里不再说明,自行搜索,网上一大堆
运行代码,得到导出的JSON文件如下:

[{
	"pingPai": ["奥迪"],
	"carTypes": [{
		"carType": ["一汽-大众奥迪"],
		"carNames": {
			"carName": ["奥迪Q2L新能源",
			"奥迪A3",
			"奥迪A4L",
			"奥迪A6L",
			"奥迪Q2L",
			"奥迪Q3",
			"奥迪Q5L",
			"奥迪A6L新能源",
			"奥迪Q4",
			"奥迪A4",
			"奥迪A6",
			"奥迪Q5"]
		}
	},
	{
		"carType": ["Audi Sport"],
		"carNames": {
			"carName": ["奥迪RS 3",
			"奥迪RS 4",
			"奥迪RS 5",
			"奥迪RS 6",
			"奥迪RS 7",
			"奥迪R8",
			"奥迪TT RS",
			"奥迪RS Q3",
			"奥迪RSQ e-tron"]
		}
	},
	{
		"carType": ["奥迪(进口)"],
		"carNames": {
			"carName": ["奥迪e-tron",
			"奥迪A3(进口)",
			"奥迪S3",
			"奥迪A4(进口)",
			"奥迪A5",
			"奥迪S4",
			"奥迪S5",
			"奥迪A6(进口)",
			"奥迪S6",
			"奥迪A7",
			"奥迪S7",
			"奥迪A8",
			"奥迪Q7",
			"奥迪Q7新能源",
			"奥迪TT",
			"奥迪TTS",
			"奥迪A0",
			"奥迪A1",
			"奥迪S1",
			"e-tron Concept",
			"奥迪AI:ME",
			"奥迪A6新能源(进口)",
			"奥迪A7新能源",
			"奥迪Aicon",
			"奥迪e-tron GT",
			"Prologue",
			"奥迪A8新能源",
			"奥迪A9",
			"奥迪S8",
			"allroad",
			"奥迪Q2",
			"奥迪SQ2",
			"奥迪Q3(进口)",
			"奥迪Q4(进口)",
			"奥迪Q4新能源(进口)",
			"奥迪TT offroad",
			"h-tron quattro",
			"奥迪Elaine",
			"奥迪Q5(进口)",
			"奥迪Q5新能源(进口)",
			"奥迪SQ5",
			"奥迪Q8",
			"奥迪SQ7",
			"奥迪Q9",
			"e-tron Vision Gran Turismo",
			"quattro",
			"奥迪PB18",
			"奥迪R18",
			"奥迪Urban",
			"奥迪A2",
			"奥迪80",
			"奥迪A3新能源(进口)",
			"奥迪Coupe",
			"奥迪100",
			"Crosslane Coupe",
			"奥迪Cross",
			"Nanuk"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"
},
{
	"pingPai": ["阿斯顿·马丁"],
	"carTypes": [{
		"carType": ["阿斯顿·马丁"],
		"carNames": {
			"carName": ["Rapide",
			"V8 Vantage",
			"Vanquish",
			"阿斯顿·马丁DB11",
			"阿斯顿·马丁DBS",
			"Cygnet",
			"Rapide E",
			"阿斯顿·马丁DBX",
			"V12 Vantage",
			"阿斯顿·马丁DB9",
			"AM-RB 003",
			"Heritage EV",
			"Virage",
			"Vulcan",
			"阿斯顿·马丁CC100",
			"阿斯顿·马丁DB10",
			"阿斯顿·马丁DB5",
			"阿斯顿·马丁DP-100",
			"战神",
			"拉共达Taraf",
			"Ulster",
			"V12 Zagato",
			"阿斯顿·马丁DB6",
			"阿斯顿·马丁One-77"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M06/AE/B5/100x100_f40_autohomecar__wKgHEVs9u6GAPWN8AAAYsmBsCWs847.png"
},
{
	"pingPai": ["AC Schnitzer"],
	"carTypes": [{
		"carType": ["AC Schnitzer"],
		"carNames": {
			"carName": ["AC Schnitzer 3系",
			"AC Schnitzer M4",
			"AC Schnitzer 7系",
			"AC Schnitzer X6",
			"AC Schnitzer X5"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M01/B0/62/100x100_f40_autohomecar__ChcCQFs9vBKAO3YSAAAW0WOWvRc555.png"
},
{
	"pingPai": ["安凯客车"],
	"carTypes": [{
		"carType": ["安凯客车"],
		"carNames": {
			"carName": ["宝斯通"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M00/AB/C8/100x100_f40_autohomecar__ChcCSFs8riCAYVA2AAApQLgf8a0969.png"
},
{
	"pingPai": ["阿尔法·罗密欧"],
	"carTypes": [{
		"carType": ["阿尔法·罗密欧"],
		"carNames": {
			"carName": ["Giulia",
			"Stelvio",
			"MiTo",
			"Giulietta",
			"Tonale",
			"ALFA 4C",
			"Disco Volante",
			"Gloria",
			"ALFA 147",
			"ALFA 156",
			"ALFA 159",
			"ALFA 166",
			"ALFA 2uettottanta",
			"ALFA 8C",
			"ALFA GT",
			"ALFA S.Z.",
			"ALFA TZ3"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M05/B0/29/100x100_f40_autohomecar__ChcCP1s9u5qAemANAABON_GMdvI451.png"
},
{
	"pingPai": ["ALPINA"],
	"carTypes": [{
		"carType": ["ALPINA"],
		"carNames": {
			"carName": ["ALPINA B4",
			"ALPINA B3",
			"ALPINA D5",
			"ALPINA B7",
			"ALPINA XD3"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M05/AB/2E/100x100_f40_autohomecar__wKgHHls8hiKADrqGAABK67H4HUI503.png"
},
{
	"pingPai": ["ABT"],
	"carTypes": [{
		"carType": ["ABT"],
		"carNames": {
			"carName": ["ABT A3",
			"ABT A5",
			"ABT TT"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M07/B0/47/100x100_f40_autohomecar__wKgHPls9vLOAHILAAAAWGGhA_W0282.png"
},
{
	"pingPai": ["AEV ROBOTICS"],
	"carTypes": [{
		"carType": ["AEV ROBOTICS"],
		"carNames": {
			"carName": ["Modular Vehicle System"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g3/M02/58/D3/autohomecar__ChcCRVw0TJaAM8BmAAAS-7AD7DQ372.png"
},
{
	"pingPai": ["Agile Automotive"],
	"carTypes": [{
		"carType": ["Agile Automotive"],
		"carNames": {
			"carName": ["Agile Automotive SC122",
			"Agile Automotive SCX"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M09/AF/8C/100x100_f40_autohomecar__wKgHHVs9r62AIbiYAAAvAsqdpoA594.png"
},
{
	"pingPai": ["Apollo"],
	"carTypes": [{
		"carType": ["Apollo"],
		"carNames": {
			"carName": ["Apollo N",
			"Arrow",
			"Intensa Emozione"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M06/B0/C6/100x100_f40_autohomecar__ChcCR1s90RGASBRgAACz67wh_68723.png"
},
{
	"pingPai": ["Arash"],
	"carTypes": [{
		"carType": ["Arash"],
		"carNames": {
			"carName": ["AF8 Cassini",
			"Arash AF10"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g30/M05/AA/D4/100x100_f40_autohomecar__wKgHHFs8n1CAVhcNAAAV3xEAiDM531.png"
},
{
	"pingPai": ["ARCFOX"],
	"carTypes": [{
		"carType": ["北汽新能源"],
		"carNames": {
			"carName": ["ARCFOX-1",
			"ARCFOX ECF Concept",
			"ARCFOX-7",
			"ARCFOX-GT"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g27/M02/AB/F7/100x100_f40_autohomecar__ChcCQFs8nA6AP-h5AABsvxhHw3E709.png"
},
{
	"pingPai": ["Aria"],
	"carTypes": [{
		"carType": ["Aria"],
		"carNames": {
			"carName": ["Aria FXE"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g28/M0B/B0/0D/100x100_f40_autohomecar__wKgHI1s9r2iAJwIXAAAIBShzq60456.png"
},
{
	"pingPai": ["ATS"],
	"carTypes": [{
		"carType": ["ATS"],
		"carNames": {
			"carName": ["ATS GT"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g26/M08/D7/D3/autohomecar__ChsEe1wYwKmAY2p9AAA1NP0jCHk594.png"
},
{
	"pingPai": ["Aurus"],
	"carTypes": [{
		"carType": ["Aurus"],
		"carNames": {
			"carName": ["Senat"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g27/M07/F3/E1/autohomecar__ChcCQFuN6WiAcztKAAAsLfBmU9g074.png"
},
{
	"pingPai": ["艾康尼克"],
	"carTypes": [{
		"carType": ["艾康尼克ICONIQ Motors"],
		"carNames": {
			"carName": ["MUSE",
			"艾康尼克七系"]
		}
	}],
	"picUrl": "https://car2.autoimg.cn/cardfs/series/g29/M0A/A9/EC/100x100_f40_autohomecar__wKgHG1s8iP6ASbjTAAAOIwskkzo314.png"
},
{
	"pingPai": ["爱驰"],
	"carTypes": [{
		"carType": ["爱驰汽车"],
		"carNames": {
			"carName": ["爱驰U5",
			"爱驰U7",
			"RG Nathalie"]
		}
	}],
	"picUrl": "https://car3.autoimg.cn/cardfs/series/g29/M09/A9/9B/100x100_f40_autohomecar__wKgHG1s8fwqAOp3IAAALEeTkn6c536.png"
}]
原文地址:https://www.cnblogs.com/zh672903/p/11018891.html