ZENCODE-ITN ATC1 - Vaquerizas Lab

In this training session we aim to get you familiarised with genome browsers. The exercises below are mostly Ensembl based, but please feel free to try other browsers of your choice.

First, we will go through some walkthrough exercises from http://www.ensembl.org

1. Looking at loci in detail view
https://drive.google.com/open?id=1nCas3oncrA4Mh4UL7GmoyebEake9JCVp8kI7kiHpK68

2. Gene and transcript information
https://drive.google.com/open?id=1moFTtSBF0pgVf4vcxwZDughkYYKoToHUkGy050xNpZQ

3. Using pre-computed comparative genomics
https://drive.google.com/open?id=1b6tRtEFDAZ94TL73WmAmjyvw2VpP0aq43nPUbVZOo58

4. Regulatory tracks and epigenetics
https://drive.google.com/open?id=1fGw7oVhoOCAecQAix4IIMb_8u4RCfpgaoCGzIJ6CmgE

5. Mining Ensembl (Biomart)
http://www.ebi.ac.uk/seqdb/confluence/download/attachments/27918627/walk_through_biomart_e76.docx?version=1&modificationDate=1406897114000&api=v2

Once you have familiarised yourself with these resources, please proceed to answer the following questions.

Exercise 1 – Exploring a gene

(a) Search for the zebrafish tead1a gene. On which chromosome is this gene located? How many transcripts (splice variants) has Ensembl annotated for it? Are these transcribed from the forward or from the reverse strand of the genome assembly?

(b) What is the longest transcript? How long is the protein it encodes? How many exons does it have? Are any of the exons completely or partially untranslated? How do the transcripts differ?

(c) Have a look at the General identifiers for one of the tead1a transcripts. Click on some of the links. What is the function of tead1a?

(d) Which PFAM domains do the proteins encoded by tead1a contain?

(e) Is there a human ortholog predicted for the zebrafish tead1a gene? What ‘type’ does it have? Why?

(f) If you have yourself a gene of interest, explore what information Ensembl displays about it!

Advanced questions:
(1) What does ZFIN say about tead1a?
(2) What are the paralogs of tead1a?

Exercise 2 – Exploring a region

(a) Go to the region from bp 33100000 to 33350000 on zebrafish chromosome 13. How many contigs make up this portion of the assembly (contigs are contiguous stretches of DNA sequence that have been assembled solely based on direct sequencing information, in the zebrafish assembly there are finished clones and whole genome shotgun contigs)?

(b) Make the tilepath clones (i.e. the BAC clones that were sequenced to generate the sequence for the human genome assembly) visible, what are the clone names in this region? Note that these clones are not shown by default! Which clone library does the clone containing the btbd6a gene come from?

(c) Zoom in on the btbd6a transcript, including a bit of flanking sequence on both sides. Which markers are located close by? Do the markers appear anywhere else in the genome?

(d) Export the genomic sequence of the region you are looking at in FASTA format.

(e) Is this region being worked on by the Genome Reference Consortium?

(f) If you have yourself a genomic region of interest, explore what information Ensembl displays about it!

Exercise 3 – mining Ensembl

Generate a list of all zebrafish protein coding genes on chr1 with a ZFIN ID that have more than one splice variants and that are causing the caudal fin to be absent when mutated. Download the peptide sequences and make sure the header states the Ensembl ID, a description, the associated gene name and the associated gene DB.

Exercise 4 – mining Ensembl

BioMart is a very handy tool when you want to map IDs between different databases. The following is a list of 29 IDs of human proteins from the RefSeq database of NCBI (http://www.ncbi.nlm.nih.gov/projects/RefSeq/):

NP_001218, NP_203125, NP_203124, NP_203126, NP_001007233, NP_150636, NP_150635, NP_001214, NP_150637, NP_150634, NP_150649, NP_001216, NP_116787, NP_001217, NP_127463, NP_001220, NP_004338, NP_004337, NP_116786, NP_036246, NP_116756, NP_116759, NP_001221, NP_203519, NP_001073594, NP_001219, NP_001073593, NP_203520, NP_203522

Generate a list that shows to which Ensembl Gene IDs and to which HGNC symbols these RefSeq IDs correspond. Which of these genes have a zebrafish ortholog?

Exercise 5 – mining Ensembl

Generate a list of all zebrafish genes on chr 1 that have an human ortholog on human chr 13. Display the gene names, are they the same? Note: This requires you to select an additional dataset.

Exercise 6 – ZF transcription factors

Devise an strategy to identify the set of sequence-specific DNA-binding transcription factors encoded in the zebrafish genome.

You can find a list of sequence-specific DNA binding domains (DBD) here:
http://www.nature.com/nrg/journal/v10/n4/extref/nrg2538-s2.txt