Tetrahymena Gene Network Explorer (TGNE)

TGNE Graphical Abstract

The Problem & Motivation:

The model organism Tetrahymena thermophila is foundational to biological discoveries but is difficult to study with high-throughput genetic screens. While a wealth of transcriptomic data (both legacy microarray and modern RNA-seq) exists, it remained fragmented in unconnected datasets. This project was motivated by the need to unify these disparate data types under a single, robust bioinformatic pipeline to make functional gene discovery more accessible to researchers.

Key Features:

Data Integration Pipeline: Implements a rigorous pipeline to process, perform quality control (RMA normalization, NUSE, batch effect removal), filter, and normalize disparate microarray and RNA-seq datasets into a common analytical framework.

Validated Co-Expression Clustering: Uses a Pareto-optimized parameter scan (testing distance metrics, nearest neighbors) to build a high-dimensional graph (UMAP) which is then partitioned using the Leiden community detection algorithm to identify statistically significant co-expression modules.

Interactive HTML Dashboard (TGNE): A standalone, maintenance-free web application that allows researchers to search for genes, visualize expression heatmaps, explore UMAP embeddings, and analyze functional enrichment (GO, KEGG) for any gene cluster.

Hypothesis Validation: The methodology was successfully validated by recovering over 80% of previously known genes involved in mucocyst biogenesis and experimentally confirming that four newly identified, co-expressed genes are involved in the pathway.

My Contributions:

As a lead developer on this project, my responsibilities covered the entire pipeline except for the initial microarray data processing. This included:

Developing and implementing the RNA-seq quantification pipeline using kallisto.
Designing and executing the core co-expression clustering methodology, including the Pareto-optimized parameter scan, UMAP graph construction, and Leiden community detection.
Building the interactive Bokeh dashboard (TGNE) from the ground up for data visualization and hypothesis generation.
Conducting the computational validation, including the implementation of scrambled and simulated negative controls to verify the statistical significance of our clusters.

Tech Stack & Implementation:

Data Processing & Normalization: kallisto for RNA-seq transcript quantification. The legacy microarray pipeline was handled by collaborators using R (oligo, limma).

Clustering & Analysis: Python with pandas and numpy for data manipulation. scikit-learn and scipy were used for distance matrix calculations and statistical controls.

Network & Community Detection: Graphs were constructed using umap-learn and clustered using the leidenalg library. Parameter optimization was guided by networkx modularity scores.

Visualization & Dashboard: Bokeh was used to generate all plots and the final interactive, standalone HTML dashboard.

Resources & Citation

Live Dashboards and Guide

Michael A. Bertagna, Lydia J. Bright, Fei Ye, Yu-Yang Jiang, Debolina Sarkar, Ajay Pradhan, Santosh Kumar, Shan Gao, Aaron P. Turkewitz, Lev M. Z. Tsypin, “Inferring gene-pathway associations from consolidated transcriptome datasets: an interactive gene network explorer for Tetrahymena thermophila,” NAR Genomics and Bioinformatics, Volume 7, Issue 2, June 2025, lqaf067. doi:10.1093/nargab/lqaf067

The Problem & Motivation:#

Key Features:#

My Contributions:#

Tech Stack & Implementation:#

Resources & Citation#

The Problem & Motivation:

Key Features:

My Contributions:

Tech Stack & Implementation:

Resources & Citation