21/10: structured data in numpy
Numpy supports structured arrays, which are the nearest thing to R's data.frame class. Data are organized into fields and records. Each field (column) has a name and data type, and each record (row) has a value for all the fields. Columns are indexed by name, and rows are indexed by integers. Recarray objects can be generated from nested Python iterable objects using numpy.rec.fromrecords:
>>> D = [('fair',6.0,1), ('good',12,2)]
>>> D = numpy.rec.fromrecords(D, names='quality,price,size')
>>> D
rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])
>>> D['quality']
rec.array(['fair', 'good'],
dtype='|S4')
>> D[0]
('fair', 6, 1)
Note that the 'price' field has a float data type because one of the records has a float value, and the field is promoted to the most general data type. For more precise control over field data types, fromrecords() takes a format argument, which is a comma-delimited list of format strings. For instance, to force 'price' to be an integer, call
For reading and writing recarrays, use matplotlib.mlab.rec2csv() and matplotlib.mlab.csv2rec(). The format of each field can be specified using a dictionary. There are a number of arguments to both functions that can be used to control how the data is read in (e.g. delimiter, is the first row a list of field names, etc), most of which are documented. The rec2csv() function always outputs field names as headers. To avoid this behavior, or to avoid having a dependency on matplotlib, use numpy.savetxt()
>>> from matplotlib import mlab
>>> formatd = {'quality' : mlab.FormatString(), 'price' : mlab.FormatFloat(2),}
>>> mlab.rec2csv(D, 'test.csv', formatd=formatd)
>>> mlab.csv2rec('test.csv')
rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])
>>> numpy.savetxt('test.csv', D, delimiter=',', fmt=('%s','%3.2f','%d'))
>>> numpy.loadtxt('test.csv', delimiter=',', dtype={'names': ('quality','price','size'), 'formats' : ('S4', 'f8', 'i4')})
array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])
>>> D = [('fair',6.0,1), ('good',12,2)]
>>> D = numpy.rec.fromrecords(D, names='quality,price,size')
>>> D
rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])
>>> D['quality']
rec.array(['fair', 'good'],
dtype='|S4')
>> D[0]
('fair', 6, 1)
Note that the 'price' field has a float data type because one of the records has a float value, and the field is promoted to the most general data type. For more precise control over field data types, fromrecords() takes a format argument, which is a comma-delimited list of format strings. For instance, to force 'price' to be an integer, call
D = numpy.rec.fromrecords(D, names='quality,price,size', formats='S4,i4,i4')For reading and writing recarrays, use matplotlib.mlab.rec2csv() and matplotlib.mlab.csv2rec(). The format of each field can be specified using a dictionary. There are a number of arguments to both functions that can be used to control how the data is read in (e.g. delimiter, is the first row a list of field names, etc), most of which are documented. The rec2csv() function always outputs field names as headers. To avoid this behavior, or to avoid having a dependency on matplotlib, use numpy.savetxt()
>>> from matplotlib import mlab
>>> formatd = {'quality' : mlab.FormatString(), 'price' : mlab.FormatFloat(2),}
>>> mlab.rec2csv(D, 'test.csv', formatd=formatd)
>>> mlab.csv2rec('test.csv')
rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])
>>> numpy.savetxt('test.csv', D, delimiter=',', fmt=('%s','%3.2f','%d'))
>>> numpy.loadtxt('test.csv', delimiter=',', dtype={'names': ('quality','price','size'), 'formats' : ('S4', 'f8', 'i4')})
array([('fair', 6.0, 1), ('good', 12.0, 2)],
dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])
ep wrote:
is mlab the fastest read-in fcn around?