Index of /pig/sscrofa10_2_annotation

README file (readme.txt)

December 1st 2011
Henrik Hornshøj
Aarhus University, Denmark

Last update: February 15th 2012

Sus scrofa genome assembly 10.2 
- Augustus gene predictions and Cuffmerge reference-based transcriptome assembly
--------------------------------------------------------------------------------

To contribute to the annotation of the Pig Genome version 10.2, gene predictions
were performed using the AUGUSTUS software (http://augustus.gobics.de) that can
incorporate genome-aligned mRNA sequence evidence as hints to improve the gene
predictions. This gene prediction analysis pipeline is part of various ongoing
TRANSCRIPTOME studies of porcine tissues performed in our group focusing on
differential regulation of gene expression. A brief description of the workflow
is described below.

A large collection of porcine mRNA sequences was established from various
internal and external sources:

- more than one billion local Illumina RNA-seq reads
  (muscle, liver, lung, brain, kidney, spleen, heart)
- 5.3 million local 454 Roche / Sanger sequencing long EST sequences
  (100 different tissues)
- around 400 million Pinky tabasco clone Illumina RNA-seq reads
  (pool of 10 different tissues)
- 19039 Ensembl Known cDNAs (November 2, 2011)
- 3310 NCBI RefSeq mRNAs (Release 49)
- 1.3 million NCBI ESTs (UniGene build 41)

The mRNA sequences were mapped with tophat software (http://tophat.cbcb.umd.edu)
to the genome sequence target database build from assembled chromosomes
(1-18,X,Y) downloaded from:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/
Eukaryotes/vertebrates_mammals/Sus_scrofa/Sscrofa10.2

Tophat options were adjusting according to the various sequence types from
different sources. Tophat alignment bam files were processed by the cufflinks
software pipeline (http://cufflinks.cbcb.umd.edu) for genome-wide reference-based mRNA
transcript assembly from Cuffmerge and generation of a single transcriptome hints file.
AUGUSTUS gene prediction was performed separately on each assembled chromosome
(chrID) using the following command line:

augustus --sample=0 --maxDNAPieceSize=200000 --hintsfile=chrID.gff
--alternatives-from-evidence=false --alternatives-from-sampling=false
--progress=true --gff3=on --UTR=on --extrinsicCfgFile=extrinsic.M.RM.E.W.cfg
--noInFrameStop= --noprediction= --uniqueGeneId=true --species=human
--allow_hinted_splicesites=atac --protein=on --introns=on --start=on --stop=on
--cds=on --codingseq=on chrID.fasta

In total 21077 gene predictions were produced by AUGUSTUS of which 18328 map to
14414 NCBI human RefSeq targets and 20864 map to 16754 NCBI Mammalian RefSeq
targets (TeraBLASTNH, e-value cut-off 1e-8).

Cuffmerge reference-based transcript assemblies, AUGUSTUS gene prediction and
BLAST homology annotation files (gzipped):

- ssc10.2.RNA.hints.cuffmerge.merged.gtf.gz [7.9M]
  Cuffmerge reference-based assemblies in GTF format for use as RNA hints with AUGUSTUS 
- ssc10.2.RNA.hints.cuffmerge.merged.gtf.gffread.fa.gz [46M]
  Fasta file of Cuffmerge reference-based transcript assemblies
  (using gffread tool)
- ssc10.2.RNA.hints.cuffmerge.merged.gtf.gffread.vs.hsa_refseq_2011_10_03.txt.gz [2.9M]
  BLAST transcript homology of Cuffmerge transcripts to NCBI RefSeq mRNA targets,
  Homo sapiens [2011-10-03]   
- ssc10.2.RNA.hints.augustus.gff.gz [24M]
  AUGUSTUS output file in GFF file format  
- ssc10.2.RNA.hints.augustus.gff.cds.fna.gz [9.0M] 
  Fasta file with coding sequences extracted from AUGUSTUS output GFF
- ssc10.2.RNA.hints.augustus.gff.prot.faa.gz [5.8M] 
  Fasta file with protein sequences extracted from AUGUSTUS output GFF
- ssc10.2.RNA.hints.augustus.gff.gffreads.fna.gz [12M] 
  Fasta file with transcript files extracted from AUGUSTUS output GFF
  (using gffread tool)
- ssc10.2.RNA.hints.augustus.gff.gffreads.fna.vs.hsa_refseqHsa_2011_10_03.txt.gz [990K]
  BLAST transcript homology to NCBI RefSeq mRNA targets, Homo sapiens [2011-10-03]
- ssc10.2.RNA.hints.augustus.gff.gffreads.fna.vs.refseq49_mam.txt.gz [1.1M] 
  BLAST transcript homology to NCBI RefSeq mRNA targets, Mammalian [release 49]


The following assembled chromosomes have been used in the AUGUSTUS gene predictions:

chr.id  chr.accession                chr.name                  chr.length
-------------------------------------------------------------------------
1       gi|345632377|gb|CM000812.4|  Sus scrofa chromosome 1   315321322
2       gi|345632376|gb|CM000813.4|  Sus scrofa chromosome 2   162569375
3       gi|345632375|gb|CM000814.4|  Sus scrofa chromosome 3   144787322
4       gi|345632374|gb|CM000815.4|  Sus scrofa chromosome 4   143465943
5       gi|345632373|gb|CM000816.4|  Sus scrofa chromosome 5   111506441
6       gi|345632372|gb|CM000817.4|  Sus scrofa chromosome 6   157765593
7       gi|345632371|gb|CM000818.4|  Sus scrofa chromosome 7   134764511
8       gi|345632370|gb|CM000819.4|  Sus scrofa chromosome 8   148491826
9       gi|345632369|gb|CM000820.4|  Sus scrofa chromosome 9   153670197
10      gi|345632368|gb|CM000821.4|  Sus scrofa chromosome 10   79102373
11      gi|345632367|gb|CM000822.4|  Sus scrofa chromosome 11   87690581
12      gi|345632366|gb|CM000823.4|  Sus scrofa chromosome 12   63588571
13      gi|345632365|gb|CM000824.4|  Sus scrofa chromosome 13  218635234
14      gi|345632364|gb|CM000825.4|  Sus scrofa chromosome 14  153851969
15      gi|345632363|gb|CM000826.4|  Sus scrofa chromosome 15  157681621
16      gi|345632362|gb|CM000827.4|  Sus scrofa chromosome 16   86898991
17      gi|345632361|gb|CM000828.4|  Sus scrofa chromosome 17   69701581
18      gi|345632360|gb|CM000829.4|  Sus scrofa chromosome 18   61220071
X       gi|345632359|gb|CM000830.4|  Sus scrofa chromosome X   144288218
Y       gi|345632358|gb|CM001155.2|  Sus scrofa chromosome Y     1637650

================================================================================
Users are encouraged to acknowledge the data source
Data sources/credits:
- ssc10.2.RNA.hints.cuffmerge.merged.gtf.gz
- ssc10.2.RNA.hints.cuffmerge.merged.gtf.gffread.fa.gz
- ssc10.2.RNA.hints.cuffmerge.merged.gtf.gffread.vs.hsa_refseq_2011_10_03.txt.gz
- ssc10.2.RNA.hints.augustus.gff.gz
- ssc10.2.RNA.hints.augustus.gff.cds.fna.gz
- ssc10.2.RNA.hints.augustus.gff.prot.faa.gz
- ssc10.2.RNA.hints.augustus.gff.gffreads.fna.gz
- ssc10.2.RNA.hints.augustus.gff.gffreads.fna.vs.hsa_refseqHsa_2011_10_03.txt.gz
- ssc10.2.RNA.hints.augustus.gff.gffreads.fna.vs.refseq49_mam.txt.gz



contributed by Bendixen et al [christian.bendixen@agrsci.dk]
================================================================================

For further questions contact:
Henrik Hornshøj (henrikh.jensen@agrsci.dk)
Frank Panitz (frank.panitz@agrsci.dk)
Christian Bendixen (christian.bendixen@agrsci.dk)

Aarhus University
Department of Molecular Biology and Genetics 
Faculty of Science and Technology
Blichers Allé 20
DK-8830 Tjele
Denmark