Pyhton小程序将大文件分割为多个小文件

场景

有时遇到处理数据的场景,比如跑的数据是一个文件格式,有可能数量量很大,导致文件很大。
当文件到1GB、10GB或者以上时,首先用vim或者其它编辑器打开会比较慢。
有的时候希望并发处理,将大文件分割为多个小文件,同时在不同节点处理,加快处理速度。

思路

源码文件:split_file.py
输入:python split_file.py 文件名 分隔数
输出:文件名1 文件名2 文件名3
步骤:

  1. 解析输入中的文件名、分隔数
  2. 按行读取文件总行数
  3. 根据总行数、分隔数计算成每个小文件的行数,逐行读取文件,输出到分隔的小文件中
    注:每个步骤记录并打印耗时

代码

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# 分割文件,可将大文件切分为多个小文件,如:
# python split_file.py xxx.txt 3
# xxx.txt分隔为xxx_1.txt,xxx_2.txt,xxx_3.txt
# author:cdfive
import sys
import os
import time

start_time = time.time()

file_name = sys.argv[1]
file_dir = os.path.abspath('.')
file_path = file_dir + os.path.sep + sys.argv[1]
split_num = int(sys.argv[2])
print ("file_path=%s,split_num=%s" % (file_path, split_num))

total = 0
with open(file_path) as f:
    for line in f:
        total += 1

per_size = total / split_num
print ("total line=%s,per_size=%s,count_cost=%.2fs" % (total, per_size, time.time() - start_time))

file_name_real = os.path.splitext(file_name)[0]
file_name_suffix = os.path.splitext(file_name)[1]
index = 0
per_index = 0
per_file_num = 1
write_flush_size = 10000
write_file_name = file_dir + os.path.sep + file_name_real + "_" + str(per_file_num) + file_name_suffix
w = open(write_file_name, "w")
split_start_time = time.time()
with open(file_path) as f:
    for line in f:
        index += 1
        per_index += 1
        # 本机运行可打开,服务器上跑大文件打印输出每一行会影响速度,因此注释掉
        # print ("%s=>%s" % (index, line)),
        w.write(line)
        if index % write_flush_size == 0:
            w.flush()
        if (per_file_num < split_num and per_index >= per_size) or index >= total:
            w.close()
            print ("split file %s done,file_path=%s,size=%s,cost=%.2fs,total_cost=%.2fs"
                   % (per_file_num, write_file_name, per_index
                      , time.time() - split_start_time, time.time() - start_time))
            per_index = 0
            per_file_num += 1
            write_file_name = file_dir + os.path.sep + file_name_real + "_" + str(per_file_num) + file_name_suffix
            if index < total:
                w = open(write_file_name, "w")
                split_start_time = time.time()

print ("total cost=%.2fs" % (time.time() - start_time))
原文地址:https://www.cnblogs.com/cdfive2018/p/14101482.html