On-boarding

Hello and welcome! This document is a collection of the steps to get setup in the lab, along with some useful tools and tasks that the lab is interested in. For a collection of How-to’s, see here. Lab reading background can be found here and publications here.

Getting started

What do we do?

Our work focuses on characterizing gene function and developing methods to understand gene function and dysfunction in disease.
We particularly look to gene expression and co-expression networks derived from thousands of expression datasets and samples.

Functional genomics

An important focus (and challenge) in functional genomics is gene function prediction and assessment.

Guilt-by-association

The guilt by association principle states that genes with similar functions will tend to possess similar properties. This allows previously unknown functions of a gene to be statistically inferred given some prior knowledge about other genes.

  1. A set of candidate genes: All genes in the genome or a more focused set such as those in a candidate genetic locus.

  2. One or more target gene groups of interest: Typically defined around a function, such as those from the Gene Ontology (GO).

  3. Data with associations or similarities among the target and candidate genes: These data are often represented or thought of as a network, and can include coexpression, protein interactions, genetic interactions, sequence simlarities, phylogenetic profiles, and phenotype and disease association profiles.

  4. An algorithm: This is to transfer (or infer) functional labels from the target genes to the previously unlabeled candidate genes.

Differential expression, co-expression and differential co-expression

The transcriptome is all the RNA molecules expressed from the genes of an organism.

transcript

Generally, three approaches are taken to analyse transcriptional data which include differential expression, co-expression and differential co-expression.

summary

Co-expression is meant to reflect co-regulation, co-functionality and co-variation. We have shown the utility of co-expression, in particular meta-analytic co-expression, in a variety of applications.

summary

summary

Neighbor-voting

perf1 perf2

Multifunctionality

Gene multifunctionality is a pervasive bias in functional genomics. Prior work has focused on evaluating how this bias impacts the generation of biologically non-specific results as well as highly fragile significances in a variety of fields. mf

Tools and techniques

Microarray

Notes here

RNA-sequencing

Some useful notes here and here

Bulk

List of tools here And some others like fastX and fastQC are good for QC.

Single-cell

List of all tools here. Some key tools include Seurat and scanpy. Comprehensive tutorials like Hemberg lab’s course are particulary useful.

scrnaseq

Alignment tools

https://sarbal.github.io/howdoI/workflows/howtos_alignment.html

STAR

Github here and manual. Reference here and here.

Kallisto

Github here and tutorial. Reference here.

Salmon

Github here and manual. Reference here. The single cell version (Alevin) can be found here and ref.

Bowtie2

Source and manual. References here, here and here.

Gene set enrichment tools

ermineJ

GSEA

DAVID

GEO2Enrichr

Genomic tools

GATK. Also see best practices workflows.

Samtools

BEDtools

Tabix

IGVtools

UCSC tools. This also hosts genomic data of interest (like cross species alignments).

Databases and repositories

Gene expression data

Databases

GEO

SRA and human metadata at metaSRA.

ArrayExpress

ENA

Single Cell Expression Atlas

Expression Atlas

Allen Brain Atlas

Processed expression data

Recount2

GEMMA

ARCHS4

Biojupies.  

Core datasets

GTEx

GEUVADIS

ENCODE

BrainSpan

Co-expression databases

COEXPRESdb

HumanBase

Genomic data and annotations

ENSEMBL. Note, contains data on multiple species.

GENCODE, specifically human and mouse. But modENCODE (for fly and worm).

Other types of data (such as sequence, protein, gene IDs) can be accessed from NCBI refseq through their ftp.

Ontologies

Ontologies have vocabularies (term -> term) and annotation (gene -> term) relationships. Some useful and key ontologies:
Gene Ontology Related work.

Human phenotype ontology (HPO)

Cell ontologies

Experimental factor (EFO)

Others (and data) can be found here at the Harmonizome. The OBO foundry has the standard vocabularies.

Pathways

KEGG

Reactome

Biocarta. The original site seems to be dead/down.

Protein data

biogrid

STRING

I2D

HIPPIE

Interpro

HuRI (human reference or CCSB)

Human Protein Atlas

Variant data

1000 genomes

ExAC

gnoMAD

TOPmed which can also be accessed here.

dbGAP

Gene score lists such as RVIS and pLI are derived from versions of the above.

Gene lists

MSigDB

Imprinted genes and here.

X-escapers and here.

House keeping genes from RNA-seq data here taken from this. Microarray data version here. Newer studies using single-cell data here and here.

Essential and non-essential gene lists.Haploinsufficiency

Brain lists include synaptic genes, FMRP, chromatin remodellers.

Model organisms

Orthology

Homologene

OrthoDB

BUSCO genes

Species of interest

  • Mouse (Mus musculus, 10090) at JAX
  • Yeast (Saccharomyces cerevisiae,4932) at yeastgenome or (Schizosaccharomyces pombe, 284812) at pombase
  • Fly (Drosophila melanogaster,7227) at flybase
  • Maize (Zea mays, 4577) at maizegdb, gramene, or at ensembl
  • Arabidopsis (Arabidopsis thaliana, 3702) here or AtGDB
  • Worm (Caenorhabditis elegans,6239) at wormbase
  • Zebrafish (Danio rerio, 7955) at zfin
  • Frog (Xenopus laevis, 8355) at xenbase
  • Armadillo (Dasypus novemcinctus,9361) at here and ensembl.
  • Naked mole rats (Heterocephalus glaber, 10181) at here

Other

Meta-science

Meta-research collection