读书笔记4数据的读入和保存

一、从文件读入

pandas支持文件类型，CSV, general delimited text files, Excel files, json, html tables, HDF5 and STATA。

1.Comma-separated value (CSV) files can be read using read_csv，

>>> from pandas import read_csv
>>> csv_data = read_csv(’FTSE_1984_2012.csv’)
>>> csv_data = csv_data.values
>>> csv_data[:4]
array([[’2012-02-15’, 5899.9, 5923.8, 5880.6, 5892.2, 801550000L, 5892.2],
[’2012-02-14’, 5905.7, 5920.6, 5877.2, 5899.9, 832567200L, 5899.9],
[’2012-02-13’, 5852.4, 5920.1, 5852.4, 5905.7, 643543000L, 5905.7],
[’2012-02-10’, 5895.5, 5895.5, 5839.9, 5852.4, 948790200L, 5852.4]], dtype=object)

2、Excel files

使用read_excel函数，需要两个参数，一个文件名，一个sheet名。默认会省略掉第一行数据。

from pandas import read_excel
exceldate=read_excel('score.xlsx','Sheet1');
exceldate=exceldate.values
print type(exceldate)
print exceldate.shape
exceldate[0,:]

<type 'numpy.ndarray'>
(4L, 7L)

Out[6]:

array([15, 65, 45, 48, 43, 26, 35], dtype=int64)

3、STATA files

>>> from pandas import read_stata
>>> stata_data = read_stata(’FTSE_1984_2012.dta’)
>>> stata_data = stata_data.values
>>> stata_data[:4,:2]
array([[ 0.00000000e+00, 4.09540000e+04],
[ 1.00000000e+00, 4.09530000e+04],
[ 2.00000000e+00, 4.09520000e+04],
[ 3.00000000e+00, 4.09490000e+04]])

4、不使用pandas来读取文件内容

对于Excel Files使用xlrd来读取，xlrd,负责读取excel，xlwt,负责写excel模块。

import xlrd
wb = xlrd.open_workbook('score.xlsx');
sheetnames=wb.sheet_names()
sheet = wb.sheet_by_name(sheetnames[0])
exceldate=[]
for i in xrange(sheet.nrows):
    exceldate.append(sheet.row_values(i));
print '%d rows,'%len(exceldate),'%d columns'%len(exceldate[0])

adate=np.empty(len(exceldate))
for i in xrange(len(exceldate)):
    adate[i]=exceldate[i][0];
print adate.shape
print adate


5 rows, 7 columns
(5L,)
[ 12.  15.  51.  65.  45.]

二、保存数据

1、numpy专有格式保存数据npz,

savez_compressed会在保存数据时进行压缩。

x=np.arange(10)
y=np.zeros((100,100))
np.savez_compressed('date1',x,y)
date=np.load('date1.npz')
print date['arr_0']

np.savez_compressed('date2',x=x,ontherDate=y)
date2=np.load('date2.npz');
print date2['x']

[0 1 2 3 4 5 6 7 8 9]
[0 1 2 3 4 5 6 7 8 9]

2、保存为csv文件，使用np.savatxt方法。

注意：pandas里面的read_csv和read_excel方法都会省略第一行，默认是标题

from pandas import read_csv
x=np.random.randn(10,10);
np.savetxt('date1.csv',x,delimiter=',')
date=read_csv('date1.csv')
date=date.values

print x.shape
print date.shape
print x
print date[0]
(10L, 10L)
(9L, 10L)
[[ 1.77015084 -1.80554159  1.28403537  0.2009891   0.26291606  0.08448012
   1.66140115  0.17728159  0.88959083  0.56291309]
 [ 0.58518743  1.44373927  0.54993558  0.01054313  0.59017053 -0.35133822
  -0.42014888 -0.3079049   0.94373013  1.35954942]
 [-0.54426668  0.04622141 -0.66634713  0.45793767 -0.63685413  0.99976971
  -0.39326027 -0.93163258 -0.79656236  0.72966639]
 [-0.39963295 -1.79753906  0.32433359  0.82947734  1.54987769  2.77115954
   0.22080235 -0.60776182  2.57004264  0.59011931]
 [-0.19130441 -0.12465107  1.40619987 -0.61049826 -0.39827838 -1.25752483
  -0.91058091  0.36020845 -0.10908816  1.45316786]
 [ 0.47408008 -0.28463786 -1.92910625 -0.50288128 -0.06007105 -0.12408027
  -0.84164768 -0.42411635  0.69954835 -0.41664136]
 [ 0.42336169  0.23625584  1.11511232 -1.08894244 -0.79186067 -1.71206423
  -0.02372556 -0.71933255 -1.33979181 -0.41698675]
 [-0.06578197  1.04509307  0.1279905   1.03185255  1.15403322 -0.18110707
  -0.60340346 -0.33581049  0.02637558 -1.06997906]
 [-1.84514777  1.19496964 -1.70550266  1.30863094 -1.48711603  1.55044598
   0.64066525  0.39086305  0.15076543  1.42276444]
 [-1.23244051 -0.03354092  0.84729912  0.15254869 -0.33402971 -0.59486921
  -0.28056973 -1.72189462 -0.0156615  -1.22688771]]
[ 0.58518743  1.44373927  0.54993558  0.01054313  0.59017053 -0.35133822
 -0.42014888 -0.3079049   0.94373013  1.35954942]

三、数字精度

任何系统都有数字精度，在python中，数字精度是2.2204 × 10^−16 ，当两个数相差小于这个数时，会认为是相同的两个数。表示的最小和最大数是−1.7976×10^308和 1.7976×10^308.

x1=1
eps=np.finfo(float).eps
x2=x1+eps/10
x1==x2

Out[4]:
True