Review for "Streaming algorithms for identification of pathogens and antibiotic resistance potential from real-time MinIONTM sequencing"

Completed on 10 Feb 2016 by Mick Watson .

Login to endorse this review.

Comments to author

Author response.

Cao et al present streaming algorithms for the identification of pathogens and antibiotic resistance genes using "real time" MinION sequencing. The paper has some interesting points in terms of proof-of-principle, though there is a lot of work needed before this could be implemented in practice.

A major point of discussion needs to be the fact that base-calling, via the cloud-based base caller Metrichor, is a major bottleneck to the pipeline and can add hours to the process. By using pre-basecalled data the authors bypass this issue, however in a real setting, the online base-calling would be an issue. Rapid matching on "sqiggle data" (e.g. could be discussed as a potential solution.

In order to investigate the potential bottleneck introduced by base-calling on the cloud, we ran our pipeline in actual 'real-time' on a clinical isolate, not just emulated real-time. We have added a section in the paper to report our results ("Real-time analysis of a clinical isolate"). We found that 45 minutes of sequence data took 3hours to be returned base-called from Metrichor. This meant that it took our pipeline 40 minutes to identify the species, which was based on 7 minutes of sequence data. In more detail we report the following in the manuscript.

We observed a delay from the base-calling of the data; the first read was sequenced on the MinION within one minute from starting the run, but the base-called data were received after 6 minutes. The delay tended to increase as more data were generated. We found the base-called data returned during the three hour run of the Metrichor service were actually sequenced within 45 minutes on the MinION. Thishighlights the need for local base-calling step to improve real-time analysis. Figures~\ref{F:realtime}a and \ref{F:realtime}b show the timing (from the start of the MinION run) of sample identification using our pipeline. The pipeline reported \kp{} as the only species in the sample within 10 minutes, and reached a confident interval of less than 0.1 in 40 minutes when about 200 reads were analysed. We noticed that these 200 reads were actually sequenced in 7 minutes by the MinION.
For strain identification, our pipeline initially reported ST1199 but after 2.5 hours, reported ST258 as the sequence type for this isolate. It is worth noting that the two strains are highly similar; their MLST profiles differ by only one
SNP in the seven house-keeping genes. By sequencing the isolate on the Illumina MiSeq as described above, we confirmed that the sequence type for the strain is ST258. On the other hand, the sample identification from Metrichor initially reported \kp{} 1084 (ST23), but finally reported two strains namely \kp{} JM45 (ST11) and \kp{} HS11286 (ST11) after 3 hours (Figure~\ref{F:realtime}c). During the three hour run with less than 4000 reads (16Mb of data), our pipeline reported two antibiotic resistance genes, namely sul2 (sulphonamide) and tetA (tetracycline). Our analysis of the Illumina data for this strain confirmed the presence of these two genes. Finally, we re-analysed the data from this run using the emulation described previously, and obtained the same results as from the real-time analysis.

I don't think the "pipeline" itself is described in sufficient detail and I would suggest a flowchart showing information flow through the pipeline, including software names/versions. I am also unsure the pipeline is genuinely an example of "streaming", which usually refers to the fact that data are not written to disk, simply piped from one process to another. However, the FAST5 files are written to disk by metrichor, sequence data extracted using npReader and (I assume) written to disk, picked up by BWA and then the output of BWA is streamed to other processes. The authors may or may not be aware, but there is an API to MinKNOW that allows genuine streaming of data from the MinION device. Again, these points should be discussed.

Figure 1 shows a conceptual streaming analysis pipeline of Nanopore sequencing data. As described in the three paragraphs of "Real-time analysis framework" section (page 2), our pipeline is streaming except for the base-calling step using Metrichor which requires saving data to disk but data were almost immediately picked up by npReader. As discussed above, use of base-calling on Metrichor cloud does introduce a delay, however there have been a couple of open-source base-callers released which we will work to include in the pipeline to avoid this delay.

All other steps (npReader, bwa, identification etc) are performed in streaming fashion as explained in the associated documentation We had not been aware of the API, as it was not released when we originally implemented this tool. The versions of all the tools we use are now given in the Methods section.

We have included the following two sentences to clarify the pipeline in the results and discussion section.

In each step of this pipeline, data is piped from one process to the next without being written to disk, with the exception of base-calling via Metrichor in which each read is written to disk once it has been base-called, and then read almost immediately by npReader.

and the following in the discussion:

The only step in our pipeline in which data is written to, and then re-read from disk is the base-calling step using Metrichor. npReader immediately identifies new reads as they are generated by Metrichor, however some delay can occur due to waiting for base-called data to be returned from Metrichor. Oxford Nanopore Technologies have recently opened up the Application Programming Interface to extract raw data directly from the MinION. This, together with the recent development of open-source base-calling algorithms~\cite{DavidDY2016, BozaBV2016}
to run on the local machine, will allow future development of a completely streaming pipeline, in the sense of never saving data to disk.

The authors state:

"We developed a novel strain typing method to identify the bacterial strain from the MinION sequence reads based on patterns of gene presence and absence."

I would like to know how this differs from metagenomic profilers such as Kraken (and many others), and indeed why the authors couldn't use one of these existing pipelines.

Our focus was on identifying species and strain as fast as possible, and hence with as little data as possible. We have compared our results to those produced by Kraken, which uses a k-mer based approach and is part of the What's In My Pot pipeline; and shown that our pipeline produces more accurate results with less data. Another important point of difference is that we continually update estimates of uncertainty in our estimates, unlike other existing metagenomic profilers. This is an important part of our pipeline, as it means the user can tell when a confident result has been obtained.

Finally, though the results are interesting, the conclusions are limited as the authors use pure cultures and I would be very interested to see how the platform performs on genuine clinical samples. The authors should also be aware of Phelim et al ( which has a section on use of MinION for AMR typing.

Two of the samples used in our experiments (ATCC BAA-2146 and ATCC 700603) are isolates from clinical samples which were cultured and stored by ATCC. We also now include the results from a fourth K. Pneumoniae strain, which is again a clinical isolate, collected in Greece in xxxx, in which the pipeline was genuinely run in realtime.

At the time of submission of the manuscript, the paper Bradley et all had not published in Nature Communication yet. Our manuscript had a brief discussion to its bioRxiv version. We have now updated to cite the Nat Comm version. Its worth noting that this paper also used an isolate, collected in 2014, which would be considered a 'pure culture'. As far as we are aware no lab has sequenced direct from clinical sample using the MinION, as there are substantial challenges to be overcome in isolating/ enriching bacterial DNA from human contaminant. We are now moving towards working with clinical samples directly, with minimal and/or no culture step.