Supporting data for "Chiron: Translating nanopore raw signal directly into nucleotide sequence using deep learning". =================================================================================================================== Teng, H; Cao, M, D; Hall, M, B; Duarte, T; Wang, S; Coin, L, J (2018): GigaScience Database. http://dx.doi.org/10.5524/100425 Summary: -------- Sequencing by translocating DNA fragments through an array of nanopores is a rapidly maturing technology which offers faster and cheaper sequencing than other approaches. However, accurately deciphering the DNA sequence from the noisy and complex electrical signal is challenging. Here, we report Chiron, the first deep learning model to achieve end-to-end basecalling: directly translating the raw signal to DNA sequence without the error-prone segmentation step. Trained with only a small set of 4000 reads, we show that our model provides state-of-the-art basecalling accuracy even on previously unseen species. Chiron achieves basecalling speeds of over 2000 bases per second using desktop computer graphics processing units, making it competitive with other deep-learning basecalling algorithms. Files: ----- Read_Accuracy_Benchmark.tar.gz - This is the benchmark dataset of read accuracy accross different basecallers among 4 species. Assembly_benchmark.tar.gz - This is the benchmark dataset of assembly identity rate and relative length ratio among basecallers train.tar.gz - Training dataset of E.coli and Lambda Phage eval.tar.gz - Evaluation dataset of E.coli and Lambda Phage, files is in same format as the train dataset. Chiron-master.zip - Archival copy of the GitHub repository https://github.com/haotianteng/chiron downloaded 7-March-2018.. A basecaller for Oxford Nanopore Technologies' sequencers. GitHub archive archive -0 KB 2018-03-08 RRID:SCR_015950 NA12878-master.zip - Archival copy of the GitHub repository https://github.com/nanopore-wgs-consortium/NA12878 downloaded 7-March-2018. Oxford Nanopore Human Reference Datasets, data hosted on AWS. License of these data are CC-BY4