Open preprint reviews by Titus Brown

Squeakr: An Exact and Approximate k-mer Counting System

Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro

This paper introduces the squeakr system for exact and inexact k-mer counting.

The paper is well written and I have been able to obtain and execute

the software, although I have not spent any time trying to reproduce
the benchmarks.

The long and glorious history of approximate and exact k-mer counting
in bioinformatics (for many different purposes) is particularly well
discussed, from my (admittedly biased) perspective. I particularly
appreciate the discussion of De Bruijn graph traversal which is usually
omitted in k-mer counting papers.

squeakr implements both exact and inexact k-mer counting. squeakr
appears to perform better (?) than all other k-mer counting systems in
both inexact and exact modes, although I am unable to decipher figure
2's shading (see below). squeakr also excels in point queries graph
traversal mode and batch queries applied to large collections of
already loaded k-mers. As such, it is perhaps the most major advance
in k-mer counting I've seen in the last few years. I should say that
we are already planning to integrate the underlying CQF into our own
khmer software for these reasons, and we have found it to be relatively
straightforward; our preliminary performance benchmarks match the ones
in this paper, which is reassuring.

squeakr also has the very nice property that it uses an approximate
membership query system from which k-mers can be removed, which is

# Paper details:

I cannot distinguish KMC2 from squeakr inexact in the figures.

I could not figure out how the exact k-mer counting worked; the
wording in the section just above results (section 3) could be
improved here, in my view.

# Technical issues that should be addressed:

What commands were used to execute the benchmarks and measure the
results? Please specify in gory detail.

What version of the code was used? Please cut a release and give it a DOI
(perhaps via Zenodo).

Please describe (briefly in the paper, or perhaps in more detail on the
github repo) what kind of testing approach you used. How do we know
that the k-mer counts are correct, basically, and how can we check for
ourselves in newer versions?

I have been able to run this on my own data (hurray!) but I am unclear
as to how to work with squeakr's k-mer counts. Right now it looks like
basically I need to hook in at the c++ level - true? If so it should be
made clear that this is a nice (and very fast) proof of concept that
is not yet directly usable at the command line for k-mer counting.
(This is a documentation issue.)

# Technical upgrades:

Here are some optional issues you should think about, now or later --

1. On AWS, these are the magic commands needed to get squeakr compiled on
the following image:

ubuntu/images/hvm/ubuntu-wily-15.10-amd64-server-20160222 (ami-05384865)

sudo apt-get update
sudo apt-get install -y libboost-dev libssl-dev zlib1g-dev libbz2-dev make libboost-system-dev libboost-thread-dev

2. 'fallocate' doesn't exist on Mac OS X, so I was unable to make try
squeakr on Mac OS X. This will inhibit a lot of bioinformaticians
from making use of squeakr.

# Miscellaneous questions

We have been thinking about using a rolling hash function in khmer;
see This seems like it
could make squeakr much faster if it replaced murmurhash. Thoughts?


C. Titus Brown

show less

BIDS Apps: Improving ease of use, accessibility and reproducibility of neuroimaging data analysis methods

Krzysztof J. Gorgolewski, Fidel Alfaro-Almagro, Tibor Auer, Pierre Bellec, Mihai Capota, Mallar Chakravarty, Nathan W. Churchill, R. Cameron Craddock, Gabriel Devenyi, Anders Eklund, Oscar Esteban, Guillaume Flandin, J. Swaroop Guntupalli, Mark Jenkinson, Anisha Keshavan, Gregory Kiar, Pradeep Reddy Raamana, David Raffelt, Christopher J. Steele, Pierre-Olivier Quirion, Robert E. Smith, Stephen Strother, Gael Varoquaux, Tal Yarkoni, Yida Wang, Russell Poldrack

This is a well written and seemingly comprehensive paper about the idea of using containerization technology (Docker & Singularity) and a slightly custom framework to distribute/provide apps for neuroimaging.

The writeup is well done, and with two exceptions, we have no major comments.

The major exception is that the Singularity discussion needs to be revisited. Right now it is somewhat too aspirational and does not clearly articulate the (major) drawbacks of singularity; we had to read between the lines and do some digging on our own to figure out where the failure points were.

A few critical aspects to Singularity are either not mentioned or glossed over:

* it seems you still need root access on HPCs to *install* singularity containers.

* the Docker-to-Singularity conversion approach is very inconvenient looking, and we feel it is a major drawback.

* the imposition of "read-only" mode on container execution is understandable but again should be highlighted as inconvenient.

* Singularity *may* be installable *in theory* on many HPCs, but we don't have any idea of its adoption in practice. This could be addressed by an explicit comment that it's still early days but that at least the situation is likely to be better than it is with Docker.

We think these points need to be made more clearly in the paper.

Our second major concern - depending on the Docker Hub for archiving and versioning is not a good idea, and (at the very least) some sort of caveat should be applied and some longer-term directions suggested.

Minor correction -- 'particiapant_label' is misspelled.


C. Titus Brown
Luiz Irber

show less

Tools and techniques for computational reproducibility

Stephen R Piccolo, Adam B Lee, Michael B Frampton

In this paper, Piccolo et al. do a nice (and I think comprehensive?) job of outlining six strategies for computational reproducibility. The point is well made that science is increasingly dependent on computational reproducibility (and that in theory we should be able to do computational reproducibility easily and well) and hence we should explore effective approaches that are actually being used.

I know of no other paper that covers this array of material, and this is a quite nice exposition that I would recommend to many. I can't evaluate how broadly it will appeal to a diverse audience but it seems very readable to me.

The following comments are offered as helpful suggestions, not criticisms -- make of them what you will.

The paper almost completely ignores HPC. I'm good with that, but it's a bit surprising (since many computational scientists seem to think that reproducible orchestration of many processors is an unachievable task). Noted in passing.

I was somewhat surprised by the lack of emphasis of version control systems. These are really critical in programming for ensuring reproducibility. I also found a missing citation! You should look at (yes, sorry, I'm on the paper).

Speaking of which, I appreciate the completeness of references (and even the citation of my blog post ;) but it would be interesting to see if Millman and Perez have anything to offer: Certainly a good citation (I think you hit the book, but this is a particularly good chapter.)

I would suggest (in the section that mentions version control systems, ~line 170 of p9) recommending that authors "tag" specific versions for the publication, even if they later recommend using updated versions. (Too many people say "use this repo!" without specifying a revision.)

The section on literate programming could usefully mention that these literate programming environments do not offer good mechanisms for long running programs, so they may not be appropriate for things that take more than a few minutes to run.

Also, and perhaps most important, these literate programming environments provide REPL and can thus track exploratory data analysis and "harden" it when it works and the author moves onto another data analysis - so even if the authors don't want to clean up their notebook before publication, you can track exactly how they got their final results. I think this is important for practical reproducibility. I don't know quite what to suggest in the context of the paper but it seems like an important point to me.

Both the virtual machine and container sections should mention the challenges of raw data bundling, which is one of the major drawbacks here - not only is the VM large, but (unless you are partnering with e.g. Amazon to "scale out") you must distribute potentially large data sets. I think this is one of the biggest practical issues facing data intensive sciences. (There was a nice commentary recently by folk in human genomics begging the NIH to make human genomic data available via the cloud; I can track it down if the authors haven't seen it.)

I think it's important to emphasize how transparent most Dockerfiles are (and how this is a different culture than the VM deployment scene, where configuration systems are often not particularly emphasized except in the devops community). I view this as one of the most important cultural differences driving container adoption, and for once it's good for science!

The docker ecosystem also seems quite robust, which is important, I think.

[ ... specific typos etc omitted ... ]

show less

Deeply sequenced metagenome and metatranscriptome of a biogas-producing microbial community from an agricultural production-scale biogas plant

In this nice data paper, the authors provide a deep Illumina metagenomic and metatranscriptomic data set, assembly, and high-level analysis of a biogas reactor microbial community.

The paper is well written and the data seems to be of good quality, based on their reporting. The paper is also highly reproducible, coming with a version-controlled workflow (in a Makefile), on github, with an associated Dockerfile; using Docker is a great idea but is imperfectly executed still (see below).

All of the raw data is deposited publicly and was available to me.

Major comments:

The authors should include the number of reads that mapped back to the assembly at their 1 kb cutoff, as this would help us gauge the inclusivity of the assembly for both the DNA and RNA reads.

Minor quibbles --

I cannot evaluate the claims of priority. Isn't it sufficient to say deep Illumina metagenomes are rare and leave it at that?

For the assembly, how were these parameters picked, and is there any evaluation of sensitivity or specificity?

The GitHub and Docker URLs in the PDF have an ] at the end that blocks just clicking on them.

I would suggest deprecating the Docker discussion a bit; it didn't work for me. I also have other suggestions for modification. Details below. I might suggest putting a tag on the repo so that you can link to the last time you actually ran the Docker container.


""" docker run -v /path/to/output/directory:/home/biogas/output 2015-biogas-cebitec """

didn't work, presumably due to docker version upgrades; I needed to put metagenomics/2015-biogas-cebitec before the last bit.


The raw data is downloaded into the docker container, which can be a bit of a problem, because on AWS (where I tried to run this docker container) the containers were stored on on the root disk. There are two possible solutions that I can see --

  • do as I did in this blog post, and put the data on the host disk and then mirror it into the container:

  • use a data volume:

I think the first solution works better, but in any case, something needs to be done about putting large amounts of data in an opaque and potentially trashable container that consumes all the available disk space as part of the make file :).

Third, and most troublingly, my attempt to run the docker container failed with a missing 'unzip' command. This is probably easily fixable but does indicate a mismatch between the workflow and the container.

In sum, apart from minorly revising the docker container and/or discussion, and providing mapping rates, this looks great!

Titus Brown UC Davis Quality of written English Acceptable Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviewers: (

show less