23/10: numpy array broadcasting

Category: numpy
Posted by: dmeliza

One of the features I appreciate most about numpy is array broadcasting, probably because I hated having to use repmat() in my MATLAB days. With MATLAB, if you want to add the values in a vector to each column of a matrix, you have to replicate the vector to the same size as the matrix first, which is a colossal waste of memory (MATLAB does support something called lazy evaluation, but I'm not sure it applies to these sorts of expressions). In numpy you can simply add the vector to the matrix; however, the rules are a little tricky. There's an excellent FAQ on broadcasting here - how to specify which dimension should be matched between the two arrays, and when broadcasting is actually more inefficient.

21/10: structured data in numpy

Category: numpy
Posted by: dmeliza
Numpy supports structured arrays, which are the nearest thing to R's data.frame class. Data are organized into fields and records. Each field (column) has a name and data type, and each record (row) has a value for all the fields. Columns are indexed by name, and rows are indexed by integers. Recarray objects can be generated from nested Python iterable objects using numpy.rec.fromrecords:


>>> D = [('fair',6.0,1), ('good',12,2)]
>>> D = numpy.rec.fromrecords(D, names='quality,price,size')
>>> D

rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
      dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])

>>> D['quality']

rec.array(['fair', 'good'],
      dtype='|S4')

>> D[0]

('fair', 6, 1)
 


Note that the 'price' field has a float data type because one of the records has a float value, and the field is promoted to the most general data type. For more precise control over field data types, fromrecords() takes a format argument, which is a comma-delimited list of format strings. For instance, to force 'price' to be an integer, call D = numpy.rec.fromrecords(D, names='quality,price,size', formats='S4,i4,i4')

For reading and writing recarrays, use matplotlib.mlab.rec2csv() and matplotlib.mlab.csv2rec(). The format of each field can be specified using a dictionary. There are a number of arguments to both functions that can be used to control how the data is read in (e.g. delimiter, is the first row a list of field names, etc), most of which are documented. The rec2csv() function always outputs field names as headers. To avoid this behavior, or to avoid having a dependency on matplotlib, use numpy.savetxt()


>>> from matplotlib import mlab

>>> formatd = {'quality' : mlab.FormatString(), 'price' : mlab.FormatFloat(2),}
>>> mlab.rec2csv(D, 'test.csv', formatd=formatd)
>>> mlab.csv2rec('test.csv')

rec.array([('fair', 6.0, 1), ('good', 12.0, 2)],
      dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])

>>> numpy.savetxt('test.csv', D, delimiter=',', fmt=('%s','%3.2f','%d'))
>>> numpy.loadtxt('test.csv', delimiter=',', dtype={'names': ('quality','price','size'), 'formats' : ('S4', 'f8', 'i4')})

array([('fair', 6.0, 1), ('good', 12.0, 2)],
      dtype=[('quality', '|S4'), ('price', '<f8'), ('size', '<i4')])

 

25/08: accumarray in numpy

Category: numpy
Posted by: dmeliza
A useful and underappreciated MATLAB function is accumarray(), which aggregates a vector of data elements into a multi-dimensional array based on a separate set of index vectors. It's essentially the same as the R function xtabs(), except that the indices in accumarray() have to be numeric. I found myself needing something like this function in python when I was writing an algorithm for calculating reassigned spectrograms (more on this later; I'm working on a little monograph on time-frequency representations of birdsong). I couldn't find anything, and wound up writing the algorithm in weave/blitz -- which I probably would have done anyway to tweak the performance.

Later I discovered that there is an equivalent to accumarray() in scipy: it's the scipy.sparse.coo_matrix() constructor. Here's the example from the docstring:


row  = array([0,0,1,3,1,0,0])
col  = array([0,2,1,3,1,0,0])
data = array([1,1,1,1,1,1,1])
coo_matrix( (data,(row,col)), shape=(4,4)).todense()

matrix([[3, 0, 1, 0],
            [0, 2, 0, 0],
            [0, 0, 0, 0],
            [0, 0, 0, 1]])

 


The main downside to this method is that it only works with 2D arrays. I also don't know anything about the performance. Probably if your matrix is fairly sparse (i.e. lots of zeros) this is the way to go. If you've got more or fewer dimensions, or the output is dense, you're probably better off just writing the loop in C++.

18/03: reading and writing data in numpy

Category: numpy
Posted by: dmeliza
I've been using functions in the scipy.io module to read and write binary and ASCII data, specifically the fread(), fwrite() and write_array() functions. These have been deprecated due to some new functionality in numpy. I'm not sure how bleeding-edge this is, since I have been pretty erratic about installing numpy/scipy from svn and from official releases, but since there's no easily locatable documentation on the web at present, here are my notes. In each example, fp represents a file object (opened in the correct mode, of course), and the first line is the old scipy call that should be replaced with the new numpy call in the second line.

Writing binary data:


scipy.io.fwrite(fp, data.size, data.astype('int16').squeeze())
data.astype('int16').squeeze().tofile(fp, sep="")
 


Writing ASCII (i.e. tabular) data. I do this all the time when exporting large data sets to R. Write a header
line to the file first, and adjust the format string for your data.


scipy.io.write_array(fp, data.astype('i'))
numpy.savetxt(fp, data, fmt="%i")
 


Reading binary data (PCM sound data in this case):


data = scipy.io.fread(fp, frames, 'h', 'h')
data = numpy.fromfile(fp, dtype='h', count=frames)
 


An alternative is to use numpy's builtin memory-mapped file access, described here. This method supports complex and structured data types. For instance, I could replace my function for reading PCM data with the following line:


data = numpy.memmap("data.pcm", dtype='h', mode='r')
 


Python should handle the garbage collection on the object (data) when its reference count drops to zero, but I haven't tested this extensively. And of course you have to be careful about writing to the array, because it will affect what's on the disk. Setting the object to read-only mode (as above) will prevent this from happening, but a RuntimeError will be thrown if you try to modify the array.