Issues and principles in the analysis of large genomic datasets.

Francis Clark1, Susan Lilley2
1fc@maths.uq.edu.au, Advanced Computational Modelling Centre, University of Queensland, Australia.; 2s364202@student.uq.edu.au, School of Information Technology & Electrical Engineering, University of Queensland, Australia.

The construction of "research pipelines" for the study and analysis of genomic datasets (or similar) is a markedly different problem to that of constructing "production pipelines". The latter task is ideally performed by a software engineer as the input data and required output are well defined. A research pipeline is a different sort of beast; it often involves working with poorly understood data to answer questions that are, initially, simplistic. This poster overviews some of strategies and best practices that may be employed in such work, including; handling & appraisal of the data, choice of appropriate thresholds, extrapolation, and checking for reasonableness.