On-boarding
Hello and welcome! This document is a collection of the steps to get setup in the lab, along with some useful tools and tasks that the lab is interested in. For a collection of How-to’s, see here. Lab reading background can be found here and publications here.
Getting started
- Setup your CSHL account (this is given to you in your starting package).
- Go to the IT portal and request access to the wiki/confluence.
- Join the Gillis lab slack channel and say hi!
- Make sure you can connect to the servers listed below. PuTTy, cygwin etc.
- Setup VPN on your laptop or home computer.
Gillis lab sites
- Lab website: http://gillislab.labsites.cshl.edu
- Slack: https://gillislab.slack.com
- Github: https://github.com/gillislab/
CSHL sites
- Intranet: http://intranet.cshl.edu/
- CSHL email: https://email.cshl.edu/
- IT portal: https://jira.cshl.edu/
- Wiki/confluence: https://wiki.cshl.edu
- Leading strand (meetings videos): http://leadingstrand.cshl.edu/
- Library: http://library.cshl.edu/
- Meetings and courses http://intranet.cshl.edu/education/meetings-courses/meetings-a-courses
- Labwide calendar: http://intranetv2.cshl.edu/calendar/
Servers
These are the local servers that can only be accessed within the lab (or through a VPN).
Local servers
- dactyl.cshl.edu
- tyrone.cshl.edu
- rugen<1-6>.cshl.edu
Web server
- milton.cshl.edu
HPCC
Also known as Black and Blue, bnb, “the cluster”. There are two development nodes.
- bnbdev1.cshl.edu
- bnbdev2.cshl.edu
- filezone1.cshl.edu
What do we do?
Our work centers on characterizing gene networks to understand gene function, cell identity and disease. We particularly focus on gene networks derived from expression data through the combination of hundreds or thousands of experiments. Some of the issues surrounding network analysis are discussed in this opinion piece.
Gene function prediction
An important focus (and challenge) in functional genomics is gene function prediction.
Guilt-by-association
The guilt by association principle states that genes with similar functions will tend to possess similar properties. This allows previously unknown functions of a gene to be statistically inferred given some prior knowledge about other genes.
-
A set of candidate genes: All genes in the genome or a more focused set such as those in a candidate genetic locus.
-
One or more target gene groups of interest: Typically defined around a function, such as those from the Gene Ontology (GO).
-
Data with associations or similarities among the target and candidate genes: These data are often represented or thought of as a network, and can include coexpression, protein interactions, genetic interactions, sequence simlarities, phylogenetic profiles, and phenotype and disease association profiles.
-
An algorithm: This is to transfer (or infer) functional labels from the target genes to the previously unlabeled candidate genes.
Differential expression, co-expression and differential co-expression
The transcriptome is all the RNA molecules expressed from the genes of an organism.

Generally, three approaches are taken to analyse transcriptional data which include differential expression, co-expression and differential co-expression.

Co-expression is meant to reflect co-regulation, co-functionality and co-variation. We have shown the utility of co-expression, in particular meta-analytic co-expression, in a variety of applications.


Neighbor-voting

Multifunctionality
Gene multifunctionality is a pervasive bias in functional genomics. Our work has focused on evaluating how this bias impacts the generation of biologically non-specific results as well as highly fragile significances in a variety of fields.

Databases and repositories
Gene expression data
Databases
SRA and human metadata at metaSRA.
Processed expression data
Core datasets
Co-expression databases
Genomic data and annotations
ENSEMBL. Note, contains data on multiple species.
GENCODE, specifically human and mouse. But modENCODE (for fly and worm).
Other types of data (such as sequence, protein, gene IDs) can be accessed from NCBI refseq through their ftp.
Ontologies
Ontologies have vocabularies (term -> term) and annotation (gene -> term) relationships. Some useful and key ontologies:
Gene Ontology
Related work.
Human phenotype ontology (HPO)
Others (and data) can be found here at the Harmonizome. The OBO foundry has the standard vocabularies.
Pathways
Biocarta. The original site seems to be dead/down.
Protein data
HuRI (human reference or CCSB)
Variant data
TOPmed which can also be accessed here.
Gene score lists such as RVIS and pLI are derived from versions of the above.
Gene lists
Imprinted genes and here.
X-escapers and here.
House keeping genes from RNA-seq data here taken from this. Microarray data version here. Newer studies using single-cell data here and here.
Essential and non-essential gene lists.Haploinsufficiency
Brain lists include synaptic genes, FMRP, chromatin remodellers.
Tools and techniques
Microarray
Notes here
RNA-sequencing
Some useful notes here and here
Bulk
List of tools here And some others like fastX and fastQC are good for QC.
Single-cell
List of all tools here. Some key tools include Seurat and scanpy. Comprehensive tutorials like Hemberg lab’s course are particulary useful.

Alignment tools
https://sarbal.github.io/howdoI/workflows/howtos_alignment.html
STAR
Github here and manual. Reference here and here.
Kallisto
Github here and tutorial. Reference here.
Salmon
Github here and manual. Reference here. The single cell version (Alevin) can be found here and ref.
Bowtie2
Source and manual. References here, here and here.
Gene set enrichment tools
Genomic tools
GATK. Also see best practices workflows.
UCSC tools. This also hosts genomic data of interest (like cross species alignments).
Model organisms
Orthology
Species of interest
- Mouse (Mus musculus, 10090) at JAX
- Yeast (Saccharomyces cerevisiae,4932) at yeastgenome or (Schizosaccharomyces pombe, 284812) at pombase
- Fly (Drosophila melanogaster,7227) at flybase
- Maize (Zea mays, 4577) at maizegdb, gramene, or at ensembl
- Arabidopsis (Arabidopsis thaliana, 3702) here or AtGDB
- Worm (Caenorhabditis elegans,6239) at wormbase
- Zebrafish (Danio rerio, 7955) at zfin
- Frog (Xenopus laevis, 8355) at xenbase
- Armadillo (Dasypus novemcinctus,9361) at here and ensembl.
- Naked mole rats (Heterocephalus glaber, 10181) at here