Swimming in a sea of data

Lessons learned from a large-scale genomics project

Like Comment
Read the paper

When I joined the project that would eventually produce this publication, way back in 2015, we were a long way from the finishing line. At that stage, the laborious construction of the sequencing libraries had finally been completed – but the computational analysis had barely begun. I’d already had experience in this kind of close, experimental/computational collaboration during my PhD, but the next 5 years taught me there were many obstacles I hadn’t yet encountered! Here's a few things that, with hindsight, it would've been good to know right from the very beginning:

  1. Generating data is great, but your analysis guides your followup experiments One of the things that attracted me to this project was the wealth of genomics data that had been collected – whole-genome bisulphite for genome-wide methylation, histone mark ChIP-seq, deeply-sequenced, strand-specific total RNA-seq – but it took us a considerable amount of time to figure out what analyses were interesting or useful. When you have such a rich dataset, the temptation is to do endless exploratory analyses, to try out that new tool you just read about; but if those analyses don’t produce results that lead you to hypotheses that you can then experimentally test, they’re (mostly?) a waste of time.
  2. Sharing make things easier – and better – for everyone! One pitfall of a complex, large-scale computational project is that analysis becomes siloed: person A analyses data type A, B does B, and so on, and ne’er the twain doth meet. Which is fine until person A wants to use some of data B in their analysis…! What you probably wouldn’t realise on reading the paper is that there was a huge, behind-the-scenes effort to create a data-handling infrastructure in R that was accessible to everyone. Similarly, sharing of code and output files allowed analyses to be quickly adapted to a new question, or compared to past results. This all meant that we could try ideas out relatively quickly, rather than having to wait on someone else’s schedule. And lastly but, perhaps, most importantly: you should all use the same metadata/annotation files. It’s absolutely no fun, at all, to find out that an analysis might be wrong because it’s using subtyping information two years out of date…!
  3. In a large, inter-disciplinary team, everyone should have input Besides myself, our core team consisted of both computational and bench researchers. And while we were all very good at what we did, the alternative viewpoint was absolutely required as well: biological knowledge guided the design and interpretation of the in silico analysis, while the experimental work was informed by the computational results. Successfully bridging those two worlds was key to keeping everyone focused on the same goals.

So these might be useful things to bear in mind, the next time you’re in a similar, “Big Data” inter-disciplinary project: be very aware of the unique challenges you’re going to face. Finally, but maybe even most importantly: always take meeting notes and make sure you know who’s doing what, and when!

 

No comments yet.