Learning bioinformatics mapping pipelines

As a statistical epidemiologist and biostatistician, I need to understand how the data I have was generated and know where the data is coming from, which can only be confirmed if the data is reproducible. Through experiences and circumstances, I have realized that bioinformatics genetic mapping pipelines are not reproducible and typically developed individually in an ad hoc manner leading to reproducability issues in the field of bioinformatics.

In an effort to address this, the Common Working Language (https://github.com/common-workflow-language/cwltool) has been developed and combined with Docker to allow for reproducible pipelines.

In an additional effort to make it 100% deterministic, there is software that hashes (MD5) data and creates a common repository of data with their hashes so you can be 100% sure the data is explicitly consistent (https://guix-hpc.bordeaux.inria.fr/blog/2019/01/creating-a-reproducible-workflow-with-cwl/).

Going forward, I plan to implement bioinformatic pipelines and as a start, I have been playing with the CWL tutorial by Andrew Jesaitis (https://andrewjesaitis.com/2017/02/common-workflow-language—a-tutorial-on-making-bioinformatics-repeatable/).

Data Science Consultant

Dr. Brooke PhD MPH MSCE

Learning bioinformatics mapping pipelines

Leave a comment Cancel reply

Data Science Consultant

Dr. Brooke PhD MPH MSCE

Share this:

Related

Leave a comment Cancel reply