Gene fusions are formed by the juxtaposition of parts of two genes, resulting from structural rearrangements such as deletions and translocations. In cancer cells, many gene fusions are driver mutations that play important roles in carcinogenesis. However, scRNA-seq data have a high noise level and contain various technical artifacts that can lead to spurious fusion discoveries. Our recent study published in Nature Communications describes a single-cell gene fusion detection method named scFusion employing a statistical model and a deep learning model to control for false positives and demonstrates its applications in cancer single-cell studies.
How it started
I had just earned my bachelor’s degree in mathematics when I began my Ph.D. career in Prof. Ruibin Xi’s lab at Peking University in 2017. I was immediately immersed in the world of bioinformatics, studying the methods of processing data and exploring the connections between bioinformatics and mathematics. Our lab had rich experience in developing algorithms of structural variation (SV) and copy number variation (CNV). From our data analysis experiences, we knew that SV detection based on single-cell whole-genome sequencing (scWGS) data is challenging. Bulk SV methods detect SVs by analyzing discordant reads and chimeric reads. However, compared with bulk data, scWGS data have too many discordant reads and chimeric reads, and most of them are generated from experimental artifacts but not from true SVs. Similarly, we also observed that full-length single-cell RNA sequencing (scRNA-seq) data have a large number of chimeric reads, making gene fusion detection challenging. Although many bulk SV and gene fusion detection methods were developed, single-cell methods were still waiting to be developed. Developing single-cell SV and fusion detection methods are important and interesting research topics. However, considering the following aspects, we felt that single-cell SV detection was more challenging. (1) scWGS data of individual single cells are usually much larger than scRNA-seq data. (2) In a single cell, DNA usually only has two copies and RNA can have many more copies. (3) scRNA-seq data usually have many more cells than scWGS data. Therefore, we decided to first develop a single-cell fusion detection algorithm, even though our lab did not have much experience in developing fusion detection algorithms.
We thought that our research can be divided into four parts: a statistical model to combine the information given by all single cells, a simulation study that shows the performance of our method, a spike-in experiment with known true fusions to show the sensitivity and specificity of the method in real data, and a few application studies. But little did we know that it was harder than it seemed.
How it’s going
scFusion applies a pipeline similar to bulk methods to find fusion candidates. The most important step of scFusion is designing the models to remove a large number of artifacts. We noticed that only a very small proportion of fusion candidates had many supporting split-mapped reads, while most candidates had rare supporting reads. Candidates with few supporting reads were considered as background noises. A statistical model was needed to fit the distribution of background noises and decide the significance level of fusion candidates. We found that the background noise level was related to the gene expression and GC content. Since zero-inflated negative binomial (ZINB) distribution is commonly used in single-cell quantification analysis, we used ZINB distribution with adaptive mean, probability of zero, and over-dispersion parameters in the statistical model. It utilized the information of supporting reads from all samples and was easy to solve, but its performance did not satisfy us. We found that some fusion candidates with a high number of supporting reads were commonly called from several different datasets, which seemed to be false positives, so we suspected that some features of the reads may cause false discoveries of gene fusions. To solve this problem, we developed a bi-LSTM deep-learning model to discover and learn such patterns, and then fusion candidates supported by reads containing such patterns are filtered. Preparing the pre-train data of this neural network is difficult since known fusions are very rare (~3500 from the public dataset, compared with over 100,000 fusion candidates). We applied the proxy strategy to circumvent the problem and validated the results using real data.
Next, we need to validate our method in a simulation study and real data study. The simulation was harder than we had seen since a well-designed simulation should mimic the real-world data structure and contain false positives like technical artifacts, and little experience can learn from the existing simulations. We analyzed many real datasets and calculated the statistics of chimeric reads to generate background noises. We also designed several patterns to mimic the real-world large number of technical artifacts. The simulation cost us four months to finish.
Gene fusions alter the genome structure of two genes, so the functions and/or features of these genes may also change. In multiple myeloma data, we detected IgH-WHSC1, where WHSC1 was an oncogene that promoted the proliferation of cells. Patients with this gene fusion had worse overall survival, so we speculated that this fusion overactivated WHSC1. We compared the sequencing coverage of WHSC1 upstream and downstream to the fusion breakpoints and found that expressions were up-regulated downstream to the breakpoints, which confirmed our speculation. We also applied scFusion to three other datasets to show its performance.
The scRNA-seq technology continues to improve, both in terms of the number of genes captured and the evenness of coverage across each transcript. The detection power is also limited for rare fusions in highly heterogeneous tumor samples. These inherent limitations can be overcome only by sequencing more cells and/or sequencing each cell deeper. As its data quality approaches that of bulk RNA-seq data, it will enable a more comprehensive profiling of fusion transcripts. We expect that scFusion and the statistical/machine learning framework introduced therein will find useful applications in future single-cell studies.
Gene fusion detection can also be applied to single-molecule sequencing. One of the biggest advantages of this technology is the extreme long read: the read length is 1,000 or higher on average compared with 150 of RNA-seq. The long read promotes the accurate detection of gene fusions, the discovery of genome structure, and the detection of three-gene fusions. When single-cell sequencing technology can be fully applied to third-generation sequencing, such statistical model and machine learning model may have another chance to show their powers.
Accurate detection of SV based on scWGS data is another problem that should be solved. SV can also be the trigger of cancer development, thus, understanding SV for each cell or cell group can facilitate the analysis of tumorigenesis and sample heterogeneity.
Finally, as scFusion is applied to more and more samples, is it our hope that more algorithms can be developed to improve the fusion detection power.