The motivation for using scons to automate analysis workflows is to formally specify dependencies in how the analysis gets done. This is the idea behind "reproducible research" (a term clearly coined by non-biologists; I prefer "reproducible analysis"). You run one command and it turns your raw data into figures that are nearly ready for publication, removing a lot of potential sources of error. I've used
Sweave for a couple of projects. It's a system that allows you to embed R code chunks in a LaTeX document, and the calculations and figures generated from the embedded code get placed directly into the resulting PDF file. It's a powerful idea, but it doesn't lend itself well to the development of multi-stage analysis workflows where the output of one step in data analysis flows into the next step. Change one thing and you have to rerun the entire script. It makes a lot more sense to design each step as a modular component and formally specify the dependencies between the steps.
In the
previous post, I described how to use a custom builder to add python scripts to a
scons workflow. You can do the same thing for R scripts and Sweave documents. With a little regular expression kung-fu you can even get scons to recognize which R scripts and data sources are being imported.
import os,re,itertools
source_re = re.compile(r'^source\([\'\"](\S+)[\'\"]\)',re.M)
load_re = re.compile(r'^load\([\'\"](\S+?)[\'\"]\)',re.M)
table_re = re.compile(r'read.\S+\([\'\"](\S+?)[\'\"]',re.M)
def fix_rel(f):
return f if f.startswith('/') else ('#' + f)
def rfile_scan(node, env, path):
txt = node.get_contents()
return [fix_rel(f) for f in itertools.chain(source_re.findall(txt),
load_re.findall(txt),
table_re.findall(txt))]
rbuild = Builder(action='R -q --vanilla $SCRIPTOPTS < $SOURCE')
sweavebuild = Builder(action='R CMD Sweave $SOURCE',
suffix = '.tex',
src_suffix = '.Rnw')
rscan = Scanner(function = rfile_scan,
skeys = ['.R','.Rnw'])
You still have to manually specify what each script outputs (except for the Sweave builder, which knows the output will be a .tex file), for instance:
env = Environment()
env.Append(BUILDERS = {'RBuild' : rbuild,
'SWeave' : sweavebuild})
env.Append(SCANNERS = rscan)
unit_tbl = env.RBuild('unit_stats.tbl','unit_analysis.R')
I've started using
scons to manage my data analysis workflows. A lot of the work is done by python scripts that read in a bunch of data from one more sources, crunch it, and spit out a new table. So there is a dependency not only on the input data but on the script and any modules it imports. You can get scons to run a script using a simple Command builder. For instance, you have some "script.py" that expects a command line like "script.py input1 input2 output".
env.Command('output.tbl', ['script.py', 'input1.tbl', 'input2.tbl'], 'python $SOURCES $TARGETS')
This works fairly well, but if script.py imports another module (e.g. with functions common to a bunch of scripts) you have to manually specify that dependency. But with a little extra code you can get scons to automatically scan python scripts for import statements and include any imports from the local directory as dependencies. I also like to use an Emitter that will move the script to the front of the list of dependencies, so I don't have to worry about what order I specify them in.
import os,re
import1_re = re.compile(r'^from\s+(\S+)\s+import',re.M)
import2_re = re.compile(r'import\s+(.+)$',re.M)
def pyfile_scan(node, env, path):
imports = []
search_path = os.path.join(*os.path.split(str(node))[:-1])
text = node.get_contents()
for item in (import1_re.findall(text) + import2_re.findall(text)):
for x in item.split(','):
test_file = x.strip() + '.py'
if os.path.exists(os.path.join(search_path, test_file)): imports.append(test_file)
return imports
def py_targets(target,source,env):
""" pulls out the python script from the source list and generates a call to the script """
out = []
for x in source:
if str(x).endswith('.py'):
out.insert(0,x)
else:
out.append(x)
return target,out
pybuild = Builder(action='python $SOURCES $TARGETS $SCRIPTOPTS',
emitter=py_targets)
pyscan = Scanner(function = pyfile_scan,
skeys = ['.py'])
Now you can use the custom builder as follows, and scons will recognize any modules script.py depends on.
# add the python builder to the environment
env = Environment()
env.Append(BUILDERS = {'PyBuild' : pybuild})
env.Append(SCANNERS = pyscan)
env.PyBuild('output.tbl',['script.py','intput1.tbl','input2.tbl'])