Sake by tonyfischetti

What is it?

Sake is a way to easily design, share, build, and visualize workflows with intricate interdependencies. Sake is self-documenting because the instructions for building a project also serve as the documentation of the project's workflow. The first time it's run, sake will build all of the components of a project in an order that automatically satisfies all dependencies. For all subsequent runs, sake will only rebuild the parts of the project that depend on changed files. This cuts down on unnecessary re-building and lets the user concentrate on their work rather than memorizing the order in which commands have to be run.

Sake is free, open source cross-platform software under a very permissive license (MIT Expat) and is written in Python. Sake is stable

Quick links:

Example

Consider this example workflow that examines correlates of DUI arrests with various adolescent-related data by state.

---
# Macros
#! TEEN_STATS_URL=http://mathforum.org/workshops/sum96/data.collections/datalibrary/US_TeenStats.XL.zip.xls

fetch teen stats:
    help: fetches various teen statstics from the web
    # no dependencies
    formula: >
        curl -o teenstats.xls $TEEN_STATS_URL;
    output:
        - teenstats.xls

formatting:
    help: formatting and conversion steps
    convert teen stats to csv:
        help: uses gnumerics ssconvert to convert ugly xls to csv and cleans it
        dependencies:
            - teenstats.xls
            - convert.sh
        formula: >
            ./convert.sh;
        output:
            - teenstats.csv
    format dui stats:
        help: format raw (copy and pasted) dui/state data using perl
        dependencies:
            - rawdata.txt
        formula: >
            perl -pe 's/^(\D+)\s+([\d,]+)\s+([\d,]+)\s*/\1\t\2\t\3\n/'
            rawdata.txt | sed 's/,//g' > duistats.tsv;
        output:
            - duistats.tsv

find correlates:
    help: calls R script that finds correlates of DUI arrest in various teen statistics
    dependencies:
        - duistats.tsv
        - teenstats.csv
        - dui-correlates.R
    formula: >
        ./dui-correlates.R
    output:
        - Rplots.pdf
        - lmcoeffs.txt
...

This is an Sakefile to build/document a process that processes and formats two data files from the web and feeds it into an R script that searches for correlations and, ultimately, produces an output table and a graphic.

The entire process can be performed, start to finish by running the following command:

sake

The mandatory "help" fields are used internally by sake to produce the following output when

sake help

is run:

You can 'sake' one of the following...

"find correlates":
  - calls R script that finds correlated of DUI arrest in various teen statistics

formatting:
  - formatting and conversion steps

    "convert teen stats to csv":
      -  uses gnumerics ssconvert to convert ugly xls to csv and cleans it
    "format dui stats":
      -  format raw (copy and pasted) dui/state data using perl

"fetch teen stats":
  - fetches various teen statstics from the web

clean:
  -  remove all targets' outputs and start from scratch

visual:
  -  output visual representation of project's dependencies

Finally, a visual representation of the dependency diagram can be produced, automagically, by running the following command

sake visual

Which produces the following image Automagic dependency visualization

This is a really simple example, sure, but it's easy to see that, even for the most labyrinthine of pipelines, that a visualization like this can really help get a sense all the actions involved in a workflow. The key points here are (a) that no extra effort had to be expended by the operator/writer of the workflow to generate 'help' and visualization of dependencies, and (b) that the documentation of the workflow occurs as a result of designing and writing it.

The coupling of writing the automation and documentation makes sake a sound choice for

Scientists that want to share their scientific workflow with other researchers. Sake helps facilitate open science and reproducibility.
Data analysts with steps in their pipeline that take hours or days to finish running.
Business teams that want to share and visualize workflows amongst its members, even if they are using different computing platforms.

Installation

This project has four dependencies:

Python 2.7, 3.5, 3.6, 3.7, or 3.8
The python module, networkx
The python module, PyYAML
Graphviz

The easiest way to install sake is via pip.

Assuming you have python and easy_install installed, just run:

[sudo] easy_install pip
[sudo] pip install master-sake

OS-specific instructions are available in the documentation linked to below

Documentation

PDF documentation may be accessed here

HTML documentation may be accessed here

Support or Contact

If you're having trouble using sake; have a question; or want to contribute, please email me at tony.fischetti@gmail.com