A novel machine learning method helps to unlock the genomic code in FFPE tumour samples


In hospitals worldwide, formalin is commonly used to fix tissues from patients to help preserve the tissue samples in a life-like structure. In addition, these formalin-fixed and paraffin-embedded (FFPE) biopsies can be used for diagnostic purposes. However, formalin can cause damage to a cell’s DNA and change its sequence, making it difficult to interpret the genome.

To try to circumvent the challenge, our recently published paper describes a new machine learning method to eliminate these changes to enable accurate profiling of the mutational processes within FFPE tumour biopsies. Our work can help to unlock the genomic code in this invaluable resource.

Although I was already aware of the difficulties of analysing FFPE samples, I did not anticipate the extent of the challenge until I attempted to interpret the mutational data from our sequenced FFPE samples. Nevertheless, I still remember initial comments from Trevor, one of my PhD supervisors: “it works great in fresh tissues but may not in FFPE samples. Later on, I discovered that Trevor’s words echoed the voice of the hundreds of biologists who have continuously reported their concerns regarding the feasibility of accurately analysing these samples. 

Well…to me, this started to sound like FFPE is the sample-which-must-not-be-used; the ‘Lord Voldemort’ for clinical genomic profiling. But, unfortunately, this turned out to be true! For example, when we first sequenced a pair of FFPE samples from a colorectal cancer patient, we discovered that it was not only the mutational profiles in these FFPE samples that were unexpected but also the allele frequency distribution of the stringently filtered mutations. It was when we realised that even rigorous mutation calling and filtering pipelines would not be sufficient to eliminate all FFPE artefacts.

At this point, we decided to figure out an alternative way to mitigate FFPE artefacts. The idea of using the ensembled mutation profiles, rather than focusing on individual mutations, came into our minds immediately. This method was initially developed to study the mutational processes in fresh tumour samples. However, we believed it would be the perfect application case for FFPE samples if we considered the observed FFPE genomics data as a combination of formalin-origin (the noise) and biology-origin (the signal) mutational processes. Therefore, we could learn the artifactual mutation pattern and subtract this component to reveal the biological mutation spectrum. In other words, we attempt to ‘correct’ the FFPE artefacts, as Ville, my primary PhD supervisor, describes it.

I thought our challenge would lie in deriving mathematically stable solutions for our model. But -hold on tight- it was, in fact, the real-world data from FFPE samples that were the real villain in this scenario as these clinically collected samples comprised large variability. We found that pre-analytical factors, e.g. fixation time, DNA extract method and bioinformatics pipeline, could cause strong batch effects in their sequencing data. Therefore, a benchmark dataset would be ideal to figure out the specific reasons for producing these batch effects.

Nevertheless, despite the FFPE data variability, we still found commonly shared mutational profiles of formalin fixation. Surprisingly, the artifactual mutation patterns share remarkably similar characteristics with two pre-established biological mutational processes: SBS1 and SBS30, associated with a patient’s age and DNA base excision repair deficiency, respectively. Without this prior knowledge, one would easily over-interpret the effect of these known mutational processes within an FFPE sample. This mistake would then be propagated when deriving other clinically important signatures. Hence, we created FFPEsig,  a tool to subtract the FFPE noise and recover any masked biological profiles.

I am a data scientist transitioning from a biologist, so I always found the greatest joy in my work is to be able to explain the underlying biological mechanisms behind our observations. So when we discovered that the FFPE signatures crossed paths with the tumour mutation processes, my biologist mode was activated to seek a rationale. We found that both are initiated from cytosine deamination: one occurs in vitro, while the other happens in vivo. These BINGO moments keep me excited and proud of being a scientist.

I appreciated the collaborative environment which supported me throughout this scientific journey. I am part of a diverse team with people from different backgrounds ranging from clinicians and wet-lab scientists to bioinformaticians and mathematicians. This unique cross-disciplinary work environment is an excellent incubator for resolving challenges innovatively. Working with such a mixed group certainly inspired me to think differently about solutions whenever it felt our work had reached a  bottleneck.

In the future, using our method, it would be interesting to characterise mutational processes in large-scale FFPE samples. We believe this will provide clinically relevant insights into the causes of individual cancers. We are also looking forward to studies focusing on predicting clinically actionable signatures in FFPE samples, for instance, in homologous recombination deficiency and microsatellite instability-high tumours. Overall, we are delighted that our work can help to unlock helpful information that has otherwise been ‘fixed’ in routinely collected clinical samples.

Please sign in or register for FREE

If you are a registered user on Nature Portfolio Cancer Community, please sign in