{"id":25,"date":"2020-08-04T12:05:43","date_gmt":"2020-08-04T12:05:43","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/04\/reproducible-analysis-through-automated-jupyter-notebook-pipelines\/"},"modified":"2020-08-04T12:05:43","modified_gmt":"2020-08-04T12:05:43","slug":"reproducible-analysis-through-automated-jupyter-notebook-pipelines","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/04\/reproducible-analysis-through-automated-jupyter-notebook-pipelines\/","title":{"rendered":"Reproducible Analysis Through Automated Jupyter Notebook Pipelines"},"content":{"rendered":"<div id=\"content\">\n<h2>Reproducible Analysis Through Automated Jupyter Notebook Pipelines<\/h2>\n<p><em>Amanda Birmingham (abirmingham at ucsd.edu)<\/em><\/p>\n<p>Replicability and reproducibility* have been important components of the scientific method since Boyle and Huygens argued over their vacuum experiments in the 17th century. Since repeating a process the same way every time is one of the things computers do <em>best<\/em>, one might expect computational biology would outshine lab biology in this critical area. However, early findings incorporating bioinformatic analyses were often published with fulsome details on the wet lab work but barely a mention of the computational efforts.<\/p>\n<hr>\n<p><strong>In a 1986 paper from the Proceedings of the National Academy of Sciences, the fitting code used wasn\u2019t even mentioned in the methods\u2013though it did rank an acknowledgment (after \u201chelpful discussions\u201d)!<\/strong><\/p>\n<p><img src=\"https:\/\/i2.wp.com\/compbio.ucsd.edu\/wp-content\/uploads\/2017\/02\/2017-01-31_12.16.46_PM.png\" alt=\"\" data-recalc-dims=\"1\"><\/p>\n<p><img src=\"https:\/\/i0.wp.com\/compbio.ucsd.edu\/wp-content\/uploads\/2017\/02\/2017-01-31_3.12.29_PM.png\" alt=\"\" data-recalc-dims=\"1\"><\/p>\n<hr>\n<p>Fortunately, the field has improved, but the road from computational \u2018methods\u2019 like \u201cAlignments were run\u201d to \u201cAlignments were run with BLAST\u201d to \u201cAlignments were run with BLASTN version 2.2.6 against human\u201d to \u201cAlignments were run with NCBI BLASTN v.2.2.9 using the command <code>blastn -W 7 -q -1 -F F<\/code> against the NCBI RefSeq release 80 human transcriptome\u201d has been a long one. Comprehensive methods sections in manuscripts, moreover, really <em>shouldn\u2019t<\/em> be the end of this road for bioinformaticists. Here is another area where computational work should exceed the limits of wet labs, since a perfectly configured and analysis-ready computing environment could be disseminated with a new published work.<\/p>\n<h3 id=\"toc_2\">Jupyter Notebooks: Friend or Foe?<\/h3>\n<p>One option for such distribution is the <a href=\"https:\/\/jupyter.org\/\">Jupyter Notebook<\/a>, a new evolution of the IPython project. It describes itself as \u201ca web-based interactive computing platform\u201d and has widely been hailed as a powerful new weapon in the war for reproducibility. For example, the non-profit <a href=\"http:\/\/www.datacarpentry.org\/\">Data Carpentry<\/a>, which provides data analysis training to researchers, offers an entire workshop on \u201c<a href=\"https:\/\/reproducible-science-curriculum.github.io\/rr-jupyter-workshop\/\">Reproducible Research using Jupyter Notebooks<\/a>\u201c. (Giving a primer on notebooks themselves is beyond the scope of this post, so if you haven\u2019t experienced them, skim the gentle introduction at <a href=\"https:\/\/www.packtpub.com\/books\/content\/basics-jupyter-notebook-and-python\">Basics of Jupyter Notebook and Python<\/a>, then visit <a href=\"http:\/\/try.jupyter.org\/\">http:\/\/try.jupyter.org\/<\/a> and click on the *.ipynb file for your favorite language\u2013e.g.,<code>Welcome to Python.ipynb<\/code>\u2013to try one out yourself.)<\/p>\n<p>Jupyter Notebooks can alternate static code (in a variety of programming languages) with human-readable text, in the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Literate_programming\">literate programming<\/a> paradigm. However, Jupyter inventor Fernando Perez created them to support the more expansive concept of <a href=\"http:\/\/blog.fperez.org\/2013\/04\/literate-computing-and-computational.html\">literate computing<\/a>, in which the iterative and interactive exploratory computing so critical to investigating a new data set is captured along with a narrative describing its motivations and results. As notebooks are easily savable, modifiable, and extensible, they also offer a convenient tool for rerunning or tweaking previous data analyses. In fact, here at CCBB, we decided to deliver our analyses to customers in Jupyter Notebooks: not merely a faithful record of the precise commands we executed, the notebooks are themselves software tools that can be easily rerun if necessary.<\/p>\n<hr>\n<p><strong>Not merely a faithful record of the precise commands we executed, Jupyter Notebooks are themselves software tools that can be easily rerun if necessary. \u2026 Well, <em>almost<\/em>.<\/strong><\/p>\n<hr>\n<p>Well, <strong><em>almost<\/em><\/strong>. In our experience, notebooks\u2019 greatest strength\u2013their exceptional interactivity\u2013can sometimes feel like their greatest flaw. Two particular issues we encountered were:<\/p>\n<ol>\n<li>\n<p><strong>Subsequent analysis runs need automation more than interactivity<\/strong>. When running a custom analysis pipeline developed in a Jupyter Notebook over new data, we want to leverage the code+narrative aspect of notebooks (to ensure methods are correctly documented in a human-readable way for every pipeline run) but don\u2019t want to have to open, update, and step through the relevant notebook by hand every time new data comes in.<\/p>\n<\/li>\n<li>\n<p><strong>Official records need immutability more than interactivity<\/strong>. Different cells in a Jupyter Notebook can depend on each other: type <code>x = 3+3<\/code> into cell A, run it, and then type <code>print(x)<\/code> into cell B and run it, and you\u2019ll see the output <code>6<\/code>. However, changing the code in cell A to <code>x = 3+25<\/code> doesn\u2019t trigger automatic rerunning of cell B\u2013so you <em>still<\/em> see an output of <code>6<\/code>. While this behavior is by design, its side-effect is that careless interactions with a notebook can destroy its value as a computational record (especially since by default notebooks auto-save all changes every few minutes!) See an illustration below, shown as a looping gif: <img src=\"https:\/\/i0.wp.com\/compbio.ucsd.edu\/wp-content\/uploads\/2017\/02\/false_friend_jupyter_notebook.gif\" alt=\"\" data-recalc-dims=\"1\"><\/p>\n<\/li>\n<\/ol>\n<p>Of course, one could argue that this just means Jupyter Notebooks aren\u2019t the right tool for codifying analysis pipelines and\/or keeping official records (don\u2019t drain pasta through a baseball glove and then complain that you burnt your fingers!) However, given notebooks other advantages\u2013like ability to <a href=\"http:\/\/blog.revolutionanalytics.com\/2016\/01\/pipelining-r-python.html\">mix Python and R in the same notebook<\/a>, easy integration with useful tools both inside and outside bioinformatics (like <a href=\"https:\/\/docs.continuum.io\/anaconda\/jupyter-notebook-extensions\">Anaconda<\/a> and <a href=\"https:\/\/github.com\/igvteam\/igv.js-jupyter\">the Integrated Genomics Viewer(IGV)<\/a>), and option to <a href=\"https:\/\/github.com\/Anaconda-Platform\/nbpresent\">present live, interactive notebooks as slides<\/a>, to name just a few\u2013we were strongly motivated to look for a way around the flaws we had encountered without leaving the Jupyter Notebook ecosystem.<\/p>\n<h3 id=\"toc_4\">Building a Self-Documenting Analysis Pipeline<\/h3>\n<p>Fortunately, we weren\u2019t the first to encounter these issues. In fact, #1 (need to automate notebooks) has already been addressed with the introduction of helpful APIs and extensions like <code>nbconvert<\/code>, <code>nbformat<\/code>, and <code>nbparameterise<\/code> (Note the British spelling!). The first two allow you read and execute Jupyter Notebooks from external Python scripts, while the last provides the critical ability to inject modified variable values before each execution. These tools make it possible to serve two masters, keeping an analysis pipeline\u2019s core functionality in notebooks\u2013where it is easy for non-coders and external reveiwers to examine it\u2013while still allowing a bioinformaticist to run the pipeline automatically from script. Here\u2019s sample code to inject variables and run a notebook from a Python script, writing out the results to a new notebook file:<\/p>\n<div>\n<pre><code class=\"language-none\"># standard libraries\r\nimport os\r\n\r\n# third-party libraries\r\nimport nbformat\r\nimport nbparameterise\r\nfrom nbconvert.preprocessors import ExecutePreprocessor\r\n\r\n\r\n# modified from https:\/\/nbconvert.readthedocs.io\/en\/latest\/execute_api.html\r\ndef execute_notebook(notebook_filename, notebook_filename_out, params_dict, \r\n    run_path=\"\", timeout=6000000):\r\n    \r\n    notebook_fp = os.path.join(run_path, notebook_filename)\r\n    nb = read_in_notebook(notebook_fp)\r\n    new_nb = set_parameters(nb, params_dict)\r\n    ep = ExecutePreprocessor(timeout=timeout, kernel_name='python3')\r\n\r\n    try:\r\n        ep.preprocess(new_nb, {'metadata': {'path': run_path}})\r\n    except:\r\n        msg = 'Error while executing: \"{0}\".nn'.format(notebook_filename)\r\n        msg = '{0}See notebook \"{1}\" for traceback.'.format(\r\n                msg, notebook_filename_out)\r\n        print(msg)\r\n        raise\r\n    finally:\r\n        with open(notebook_filename_out, mode='wt') as f:\r\n            nbformat.write(new_nb, f)   \r\n\r\n\r\ndef read_in_notebook(notebook_fp):\r\n    with open(notebook_fp) as f:\r\n        nb = nbformat.read(f, as_version=4)\r\n    return nb\r\n\r\n\r\ndef set_parameters(nb, params_dict):\r\n    orig_parameters = nbparameterise.extract_parameters(nb)\r\n    params = nbparameterise.parameter_values(orig_parameters, **params_dict)\r\n    new_nb = nbparameterise.replace_definitions(nb, params, execute=False)\r\n    return new_nb<\/code><\/pre>\n<\/div>\n<p>\u2026 and here\u2019s the super-simple, one-line call to execute this functionality, assuming the notebook to be run (named <code>mynotebook.ipynb<\/code>) has two variables named <code>x<\/code> and <code>y<\/code> that I want to set:<\/p>\n<div>\n<pre><code class=\"language-none\">execute_notebook(\"mynotebook.ipynb\",\"my_new_notebook.ipynb\",{\"x\":6,\"y\":\"blue\"})<\/code><\/pre>\n<\/div>\n<p>By the way, if this isn\u2019t your first rodeo, you might be wondering \u201cwait, where and how does one define which variables a notebook has?\u201d Good catch! <code>nbparameterise<\/code>, which is doing all the heavy lifting on injecting those variable values, looks at the <strong>first code cell in the notebook<\/strong>; if that cell contains <strong>only<\/strong> simple assignment statements, it takes the variables therein as the ones it can modify. Here\u2019s an example:<\/p>\n<p><img src=\"https:\/\/i2.wp.com\/compbio.ucsd.edu\/wp-content\/uploads\/2017\/02\/2017-02-01_10.30.46_AM.png\" alt=\"\" data-recalc-dims=\"1\"><\/p>\n<p>Here the cell containing the \u201cx\u201d and \u201cy\u201d variables is the first <em>code<\/em> cell in the notebook, and it contains only assignments (and comments), so <code>nbparameterise<\/code> will recognize all its contents as variables that can be injected. (Don\u2019t get carried away and try to put more complex assignments\u2013like the call <code>your_handle = \"{0}-{1}\".format(y, x)<\/code>\u2013into that first code cell. If <code>nbparameterise<\/code> sees anything other than simple assignments of strings or numbers, it will give up on the whole cell and decide there are no variables it can inject!)<\/p>\n<p>Running the <code>execute_notebook<\/code> command given above produces a new <code>my_new_notebook.ipynb<\/code> file in the same directory as the first; the new notebook like this:<\/p>\n<p><img src=\"https:\/\/i1.wp.com\/compbio.ucsd.edu\/wp-content\/uploads\/2017\/02\/2017-02-01_10.35.55_AM.png\" alt=\"\" data-recalc-dims=\"1\"><\/p>\n<p>You can see that all the explanatory text remains, but the variable values in the first code cell have been swapped for the ones we injected\u2013and the subsequent code has been rerun with those new variable values. Nifty!<\/p>\n<p>But wait, we\u2019ve only solved half the problems! Issue #1 is fixed, since we can easily automate a notebook-based pipeline (or a notebook<strong>s<\/strong>-based pipeline, although stringing together the execution of multiple notebooks is left as an exercise for the reader\u2013hint: passing values generated by an earlier notebook into a later one <em>can\u2019t<\/em> be done with <code>nbparameterise<\/code> but <em>can<\/em> be done with a temporary file). What about issue #2, the potential for corruption of a notebook-as-record?<\/p>\n<p>Well, I have hopes that someday this will be addressed directly through the notebook, and one option is already available from that direction. If you\u2019re using Python in a Juypter Notebook, you can invoke the <code>%logstart<\/code> magic with the <code>-o<\/code> option to record sequentially all Python commands run in the notebook\u2013and their output\u2013to a log file. The log record is faithful even when the notebook is corrupted; this is demonstrated by the log file for our earlier corrupted notebook example, which correctly shows that I actually assigned x <em>twice<\/em>, once before and once after running the print statement:<\/p>\n<div>\n<pre><code class=\"language-none\"># IPython log file\r\n\r\nx = 3+3\r\nprint(x)\r\nx = 3+25<\/code><\/pre>\n<\/div>\n<p>(Why didn\u2019t this record the output of <code>print(x)<\/code>? Logging <a href=\"https:\/\/stackoverflow.com\/questions\/28241891\/ipython-magic-logstart-option-o-not-working\">doesn\u2019t include print outputs<\/a>.) Of course, it also doesn\u2019t include any of the other goodies in the notebook\u2013descriptive text, images, nice formatting\u2013basically all the stuff that makes notebooks notebooks and not just Python scripts \ud83d\ude42 . So until Fernando and company come up with some niftier option, what to do?<\/p>\n<p>We punted. Rather than trying to lock down the notebooks we deliver\u2013which would remove their utility for rerunning analyses, anyway\u2013we now deliver both a notebook <em>and<\/em> an HTML export of that notebook with each analysis. The former file is the tool and the latter is the record. The exported HTML files have everything the notebooks do\u2013nice formatting and scrolling, descriptive text, images, and of course the relevant code\u2013EXCEPT that the code is no longer runnable, which is exactly what we want in a record. (Of course, an HTML file could be explicitly changed by someone motivated to misbehave, but our main concern is preventing <em>accidental<\/em> corruption.) Here again, <code>nbconvert<\/code> comes to the rescue: adding one new function called by one new line to the end of our <code>execute_run<\/code> method automatically generates the HTML output alongside the new notebook every time we run a notebook-based pipeline (the other methods presented earlier remain unchanged).<\/p>\n<div>\n<pre><code class=\"language-none\"># additional third-party libraries\r\nfrom nbconvert import HTMLExporter\r\n\r\n# modified from https:\/\/nbconvert.readthedocs.io\/en\/latest\/execute_api.html\r\ndef execute_notebook(notebook_filename, notebook_filename_out, params_dict, \r\n    run_path=\"\", timeout=6000000):\r\n    \r\n    notebook_fp = os.path.join(run_path, notebook_filename)\r\n    nb = read_in_notebook(notebook_fp)\r\n    new_nb = set_parameters(nb, params_dict)\r\n    ep = ExecutePreprocessor(timeout=timeout, kernel_name='python3')\r\n\r\n    try:\r\n        ep.preprocess(new_nb, {'metadata': {'path': run_path}})\r\n    except:\r\n        msg = 'Error while executing: \"{0}\".nn'.format(notebook_filename)\r\n        msg = '{0}See notebook \"{1}\" for traceback.'.format(\r\n                msg, notebook_filename_out)\r\n        print(msg)\r\n        raise\r\n    finally:\r\n        with open(notebook_filename_out, mode='wt') as f:\r\n            nbformat.write(new_nb, f)\r\n        export_notebook_to_html(new_nb, notebook_filename_out)\r\n\r\n\r\ndef export_notebook_to_html(nb, notebook_filename_out):\r\n    html_exporter = HTMLExporter()\r\n    body, resources = html_exporter.from_notebook_node(nb)\r\n    out_fp = notebook_fileout_name.replace(\".ipynb\", \".html\")\r\n    with open(out_fp, \"w\", encoding=\"utf8\") as f:\r\n        f.write(body)<\/code><\/pre>\n<\/div>\n<p>And hey, presto, instant HTML records!<\/p>\n<h3 id=\"toc_5\">Conclusions<\/h3>\n<p>The public excitement around Jupyter Notebooks is largely deserved: they\u2019re an extremely versatile new tool for data analysis. For example, here at CCBB we\u2019ve used the techniques discussed in this post to build a self-documenting custom software pipeline for quantifying the results of dual-CRISPR genetic screens. To see this real-world example, check out <code>mali_pipeliner.py<\/code> and its subsidiary scripts in CCBB\u2019s <a href=\"https:\/\/github.com\/ucsd-ccbb\/jupyter-genomics\/blob\/master\/src\/crispr\/\">jupyter-genomics repository on GitHub<\/a> (and of course don\u2019t miss the notebooks themselves, in the parallel <a href=\"https:\/\/github.com\/ucsd-ccbb\/jupyter-genomics\/tree\/master\/notebooks\/crispr\">notebooks<\/a> directory). Our experience has taught us that notebooks are no magic bullet against irreproducible research\u2013but with care on the part of the notebook designer, they can be a strong new weapon nonetheless!<\/p>\n<p>* These two terms mean different things, but there is some confusion about exactly how. While the distinctions aren\u2019t relevant to this discussion, they are elucidated with examples in an excellent post from UPenn entitled <a href=\"http:\/\/languagelog.ldc.upenn.edu\/nll\/?p=21956\">Replicability vs. reproducibility \u2014 or is it the other way around?<\/a>\n<\/p>\n<h3 class=\"jp-relatedposts-headline\"><em>Related<\/em><\/h3>\n<p><a href=\"http:\/\/compbio.ucsd.edu\/outreach\/data-science-blog\/\">\u00ab Return<\/a><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>http:\/\/compbio.ucsd.edu\/reproducible-analysis-automated-jupyter-notebook-pipelines\/<\/p>\n","protected":false},"author":0,"featured_media":26,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/25"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=25"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/25\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/26"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=25"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=25"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=25"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}