Make GRanges from a GFF/GTF file

makeGRangesFromGFF(
  file,
  level = c("genes", "transcripts"),
  ignoreVersion = TRUE,
  synonyms = FALSE
)

Arguments

file

character(1). File path.

level

character(1). Return as genes or transcripts.

ignoreVersion

logical(1). Ignore identifier (e.g. transcript, gene) versions. When applicable, the identifier containing version numbers will be stored in txIdVersion and geneIdVersion, and the variants without versions will be stored in txId, txIdNoVersion, geneId, and geneIdNoVersion.

synonyms

logical(1). Include gene synonyms. Queries the Ensembl web server, and is CPU intensive.

Value

GRanges.

Details

Remote URLs and compressed files are supported.

Note

Updated 2021-03-10.

GFF/GTF specification

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines.

The GTF (General Transfer Format) format is identical to GFF version 2.

The UCSC website has detailed conventions on the GFF3 format, including the metadata columns.

Feature type

  • CDS: CoDing Ssequence. A contiguous sequence that contains a genomic interval bounded by start and stop codons. CDS refers to the portion of a genomic DNA sequence that is translated, from the start codon to the stop codon.

  • exon: Genomic interval containing 5' UTR (five_prime_UTR), CDS, and 3' UTR (three_prime_UTR).

  • mRNA: Processed (spliced) mRNA transcript.

See also:

Supported sources

Currently makeGRangesFromGFF() supports genomes from these sources:

  • Ensembl

  • GENCODE

  • RefSeq

  • UCSC

  • FlyBase

  • WormBase

Ensembl

Note that makeGRangesFromEnsembl() offers native support for Ensembl genome builds and returns additional useful metadata that isn't defined inside a GFF/GTF file.

If you must load a GFF/GTF file directly, then use makeGRangesFromGFF().

Example URLs:

  • Ensembl Homo sapiens GRCh38.p13, release 102 GTF, GFF3

  • Ensembl Homo sapiens GRCh37, release 102 (87) GTF, GFF3

GENCODE

Example URLs:

  • GENCODE Homo sapiens GRCh38.p13, release 36 GTF, GFF3

  • GENCODE Homo sapiens GRCh37, release 36 GTF, GFF3

  • GENCODE Mus musculus GRCm38.p6, release M25 GTF, GFF3

GENCODE vs. Ensembl

Annotations available from Ensembl and GENCODE are very similar.

The GENCODE annotation is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation. The GENCODE annotation is the default gene annotation displayed in the Ensembl browser. The GENCODE releases coincide with the Ensembl releases, although GENCODE can skip an Ensembl release if there is no update to the annotation with respect to the previous release. In practical terms, the GENCODE annotation is essentially identical to the Ensembl annotation.

However, GENCODE handles pseudoautosomal regions (PAR) differently than Ensembl. The Ensembl GTF file only includes this annotation once, for chromosome X. However, GENCODE GTF/GFF3 files include the annotation in the PAR regions of both chromosomes. You'll see these genes contain a "_PAR_Y" suffix.

Additionally, GENCODE GFF/GTF files import with a gene identifier containing a suffix, which differs slightly from the Ensembl GFF/GTF spec (e.g. GENCODE: ENSG00000000003.14; Ensembl: ENSG00000000003).

The GENCODE FAQ has additional details.

RefSeq

Refer to the current RefSeq spec for details.

Example URLs:

  • RefSeq Homo sapiens GRCh38.p12 GTF, GFF3

See also:

  • RefSeq FAQ

  • ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

UCSC

Example URLs:

Related URLs:

FlyBase

Example URLs:

  • FlyBase Drosophila melanogaster r6.24 GTF, GFF3

WormBase

Example URLs:

  • WormBase Caenorhabditis elegans WS267 GTF, GFF3

See also

Examples

## Some examples here are commented because they are CPU-intensive and ## can cause CI timeouts. ## Ensembl ==== file <- pasteURL( "ftp.ensembl.org", "pub", "release-102", "gtf", "homo_sapiens", "Homo_sapiens.GRCh38.102.gtf.gz", protocol = "ftp" ) genes <- makeGRangesFromGFF( file = file, level = "genes", ignoreVersion = FALSE )
#> → Making `GRanges` from GFF file (Homo_sapiens.GRCh38.102.gtf.gz).
#> → Getting GFF metadata for Homo_sapiens.GRCh38.102.gtf.gz.
#> → Importing 104a6592d9a95_Homo_sapiens.GRCh38.102.gtf.gz at /opt/koopa/opt/r/cache/AcidGenomes using rtracklayer::`import()`.
#> → Defining names by `geneId` column in `mcols`.
summary(genes)
#> [1] "EnsemblGenes object with 60675 ranges and 9 metadata columns"
## > transcripts <- makeGRangesFromGFF( ## > file = file, ## > level = "transcripts", ## > ignoreVersion = FALSE ## > ) ## > summary(transcripts) ## GENCODE ==== ## > file <- pasteURL( ## > "ftp.ebi.ac.uk", ## > "pub", ## > "databases", ## > "gencode", ## > "Gencode_human", ## > "release_36", ## > "gencode.v36.annotation.gtf.gz", ## > protocol = "ftp" ## > ) ## > genes <- makeGRangesFromGFF(file = file, level = "genes") ## > summary(genes) ## > transcripts <- makeGRangesFromGFF(file = file, level = "transcripts") ## > summary(transcripts) ## RefSeq ==== ## > file <- pasteURL( ## > "ftp.ncbi.nlm.nih.gov", ## > "genomes", ## > "refseq", ## > "vertebrate_mammalian", ## > "Homo_sapiens", ## > "all_assembly_versions", ## > "GCF_000001405.39_GRCh38.p13", ## > "GCF_000001405.39_GRCh38.p13_genomic.gff.gz", ## > protocol = "ftp" ## > ) ## > genes <- makeGRangesFromGFF(file = file, level = "genes") ## > summary(genes) ## > transcripts <- makeGRangesFromGFF(file = file, level = "transcripts") ## > summary(transcripts) ## UCSC ==== ## > file <- pasteURL( ## > "hgdownload.soe.ucsc.edu", ## > "goldenPath", ## > "hg38", ## > "bigZips", ## > "genes", ## > "hg38.ensGene.gtf.gz", ## > protocol = "ftp" ## > ) ## > genes <- makeGRangesFromGFF(file = file, level = "genes") ## > summary(genes) ## > transcripts <- makeGRangesFromGFF(file = file, level = "transcripts") ## > summary(transcripts)