Inferring genome organisation and gene regulation from 6-base sequencing data

Download this poster

Credits

Fabio Puddu¹
Annelie Johansson¹
Aurélie Modat¹
Jamie Scotcher¹
Riccha Sethi¹
Shirong Yu¹ ²
Nick Harding¹
Mark S. Hill¹
Ermira Lleshi¹
Casper Lumby¹
Jean Teyssandier¹
Michael Wilson¹ ³
Robert Crawford¹
Tom Charlesworth¹
Robert J Osborne¹
Shankar Balasubramanian¹ ⁴ ⁵
Páidí Creed¹

1 biomodal Ltd, The Trinity Building, Chesterford Research Park, Cambridge, UK.

2 Current address: Tagomics Ltd, The Cori Building, Little Abington, Cambridge, UK.

3 Current address: Department of Astrophysical Sciences, Princeton University, New Jersey, US.

4 Cancer Research UK Cambridge Institute, University of Cambridge, Cambridge, UK.

5 Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK.

Introduction

A key challenge in genomics is the generation and integration of data across different modalities. This typically requires several assays to be performed, which is costly and time-consuming. Moreover, the subsequent integration of these data across assays is technically challenging. Here, we leverage an assay (Figure 1, [1]) that enables sequencing the complete genetic sequence and the DNA modifications, 5-methylocytosine (5mC) and 5-hydroxy-methylocytosine (5hmC), from low nanogram amounts of DNA, to provide 6-base genomic data.

Given 5mC and 5hmC play key roles in gene regulation and chromatin organisation, we aimed to explore how these multimodal data could further elucidate key biological processes and yield novel insight. To this end, we trained and evaluated a series of machine-learning models to predict gene expression, chromatin accessibility, and enhancer state from 6-base sequence data.

Figure 1 | duet multiomics solution evoC is a 6-base calling technology that reads all four canonical bases plus 5mC and 5hmC.

Correlation with chromatin accessibility & mRNA

Figure 2

IGV plot at two regions of the genome showing how 5mC and 5hmC modifications vary in open and closed chromatin as defined by ATAC-seq (teal bars), and how this pattern of variation also reflects newly synthesised RNA (TT-seq[2], blue bars). Cytosine modifications correlate with gene expression and these data are consistent with 5mC and 5hmC having opposing effects.
Relationship between 5mC (left) and 5hmC (right) and categorical gene expression levels (high, intermediate low). The top panels show the mean methylation fractions across 60Kb regions around the TSS of genes in the E14 mESC cell line. The bottom panels represent similar information but show each individual gene rather than a combined summary. These plots highlight the different dynamics of 5mC and 5hmC around the TSS and into the gene body of genes with differing expression levels (quantified with nascent RNA sequencing data). In particular, this highlights how the 5hmC signal is able to clearly separate genes into high, intermediate and low expression states, whereas the 5mC signal is not clearly differentiated over these same classes.

Gene expression models

Figure 3 | Using machine learning to relate 5hmC and 5mC patterns of variation to RNA sequencing (RNA-seq) and nascent RNA sequencing (newly synthesised RNA, TT-seq).

Here, we split the genome into a series of genomic regions (2kb upstream, 250bp around TSS, 5′ UTRs, first introns and exons, introns, exons, 3′ UTRs, and 5kb downstream), and computed the mean 5mC and 5hmC fraction from duet evoC measurements. These features (minus chr. 8), along with the number of CpGs and region length, were used to train an XGBoost regression model using published E14 mRNA data[2,3]. We used the withheld chr. 8. for testing. For RNA-seq data (A), we find good correlation (R²~0.75) between predicted and actual expression. For nascent RNA data (TT-seq, B), we found that the model was able to better predict expression with a slightly higher correlation of (R²~0.85). This finding highlights that 5mC and 5hmC constitute highly dynamic signals that capture aspects of nascent transcriptional states, yielding insight into real-time transcriptional dynamics.

Chromatin accessibility models

Figure 4 | Deep learning models to predict chromatin accessibility across sequences.

We trained dilated residual convolutional neural networks to predict base-resolved chromatin accessibility across 2000bp sequences, centered around TSS regions. These models consume a combined encoding of genomic bases (mm10 reference), 5mC and 5hmC (duet evoC) as input sequences and we used public ATACseq data from E14 mESC as the target output. During training we held out chromosomes 6 and 7 for validation and chromosomes 8 and 9 as the test set.

Globally, we find that models only trained using the genomic sequence (4-base encoding, R²=0.54) perform worse than models which also had access to 5mC and 5hmC as part of the inputs (6-base encoding, R²=0.62). Shown in the panels to the left are four example regions from the test dataset with ensembled predictions (10 random initialisations) from 4-base (blue) and 6-base (red) models.

Enhancer state models

Enhancer states:

Active: actively enhances gene expression (H3K4me1 & H3K27ac)
Primed: is ready to activate gene expression (H3K4me1 but not H3K27ac)
Repressed: Enhancers with none of the above marks, also known to be specific to various late stage tissues.

Figure 5 | Classification of enhancer states in the 5mC vs. 5hmC fraction space.

Enhancers are cis-acting regulatory regions that regulate cell-type specific gene expression programs. Context specific enhancer states are typically characterised by histone modifications in flanking nucleosomes [5]. Here we used the enhancer classes defined in [5] as labels and trained an SVM model to predict these labels using only 5mC and 5hmC fractions across each enhancer region.

We found that Active enhancers typically have low 5mC and 5hmC levels, primed enhancers typically have moderate 5mC and high 5hmC levels, and repressed enhancers typically have high 5mC and low 5hmC levels. Our trained model was able to predict enhancer states on 20% of the data that was held-out during training with 85.5% accuracy. In the plot the different background shades correspond to the SVM decision boundaries.

Conclusion

We have shown that the combination of resolved methylation and genomic data combined with machine-learning can generate accurate inference of gene expression (both steady-state and nascent), chromatin accessibility, and enhancer state, demonstrating the key role of 5mC and 5hmC in gene regulation. Moreover, there is the potential for a compounding effect whereby 6-base genomic assays not only yield direct data, but also the foundations for multiple other inferred modalities. Looking ahead, these approaches could enable novel insights into core biological processes and accelerate the speed of iteration for experimental projects, where one can yield multifaceted insights and even conduct pilot experiments in-silico using predictive models.

Inferring genome organisation and gene regulation from 6-base sequencing data

Download this poster

Credits

Introduction

Correlation with chromatin accessibility & mRNA

Gene expression models

Chromatin accessibility models

Enhancer state models

Conclusion

References

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112

Download this poster

Credits

Introduction

Correlation with chromatin accessibility & mRNA

Gene expression models

Chromatin accessibility models

Enhancer st﻿ate models

Conclusion

References

Trending Articles

Enhancer state models