Package 'tidyestimate' reference manual

Title:	A Tidy Implementation of 'ESTIMATE'
Description:	The 'ESTIMATE' package infers tumor purity from expression data as a function of immune and stromal infiltrate, but requires writing of intermediate files, is un-pipeable, and performs poorly when presented with modern datasets with current gene symbols. 'tidyestimate' a fast, tidy, modern reimagination of 'ESTIMATE' (2013) <doi:10.1038/ncomms3612>.
Authors:	Kai Aragaki [aut, cre] , Paul Roebuck [cph] (Copyright holder of ESTIMATE package), Kosuke Yoshihara [aut] (Author of original ESTIMATE algorithm), Rahulsimham Vegesna [aut] (Author of original ESTIMATE algorithm), Hoon Kim [aut] (Author of original ESTIMATE algorithm), Roel Verhaak [aut] (Author of original ESTIMATE algorithm)
Maintainer:	Kai Aragaki <[email protected]>
License:	GPL (>= 2)
Version:	1.1.1.9000
Built:	2025-04-02 05:06:49 UTC
Source:	https://github.com/KaiAragaki/tidyestimate

Genes shared between six expression platforms

Description

As the ESTIMATE model was trained on a specific set of genes, only those within this dataset should be included before running estimate_scores.

These are the genes common to 6 platforms:

- Affymetrix HG-U133Plus2.0

- Affymetrix HT-HG-U133A

- Affymetrix Human X3P

- Agilent 4x44K (G4112F)

- Agilent G4502A

- Illumina HiSeq RNA sequence

The Entrez IDs for the original 10412 genes were matched to HGNC symbols using biomaRt. Duplicates and blank entries were filtered. As some have now been discovered to be pseudogenes or have been deprecated, 22 genes (at time of writing, June 2021) that were in the ESTIMATE package do not exist here.

As one gene can have multiple synonyms/aliases, and there is only one alias per line, the number of rows in the data frame (26339) does not reflect the number of unique genes in the dataset (10391).

Usage

common_genes
common_genes

Format

A data frame with 26339 rows and 3 variables:

entrezgene_id: Entrez id of the gene
hgnc_symbol: Human Genome Organisation (HUGO) Gene Nomenclature Committee symbol
external_synonym: A synonym/alias a given gene may go by or previously went by

Details

The ESTIMATE model was trained on a set of genes shared between six expression profiling platforms. Those genes are listed in this dataset.

Source

https://r-forge.r-project.org/scm/viewvc.php/pkg/estimate/data/common_genes.RData?root=estimate&view=log

Infer tumor purity using the ESTIMATE algorithm

Description

Infer tumor purity by using single-sample gene-set-enrichment-analysis with stromal and immune cell signatures.

Usage

estimate_score(df, is_affymetrix)
estimate_score(df, is_affymetrix)

Arguments

`df`	a `data.frame` of expression data, where columns are tumors and rows are genes. Gene names must be in the first column, and in the form of HGNC symbols.
`is_affymetrix`	logical. Is the expression data from an Affymetrix array?

Details

ESTIMATE (and this tidy implementation) infers tumor infiltration using two gene sets: a stromal signature, and an immune signature (see tidyestimate::gene_sets).

Enrichment scores for each sample are calculated using an implementation of single sample Gene Set Enrichment Analysis (ssGSEA). Briefly, expression is ranked on a per-sample basis, and the density and distribution of gene signature 'hits' is determined. An enrichment of hits at the top of the expression ranking confers a positive score, while an enrichment of hits at the bottom of the expression ranking confers a negative score.

An 'ESTIMATE' score is calculated by adding the stromal and immune scores together.

For Affymetrix arrays, an equation to convert an ESTIMATE score to a prediction of tumor purity has been developed by Yoshihara et al. (see references). It takes the approximate form of:

$purity = cos(0.61 + 0.00015 * ESTIMATE)$

Values have been rounded to two significant figures for display purposes.

Value

A data.frame with sample names, as well as scores for stromal, immune, and ESTIMATE scores per tumor. If is_affymetrix = TRUE, purity scores as well.

Purity scores can be interpreted absolutely: a purity of 0.9 means that tumor is likely 90 available (such as in RNAseq), ESTIMATE scores can only be interpreted relatively: a sample that has a lower ESTIMATE score than another in one study can be regarded as more pure than another, but its absolute purity cannot be inferred, nor can purity across other studies be inferred.

References

Barbie et al. (2009) <doi:10.1038/nature08460>

Yoshihara et al. (2013) <doi:10.1038/ncomms3612>

Examples

filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> 
  estimate_score(is_affymetrix = TRUE)
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> 
  estimate_score(is_affymetrix = TRUE)

Remove non-common genes from data frame

Description

As ESTIMATE score calculation is sensitive to the number of genes used, a set of common genes used between six platforms has been established (see ?tidyestimate::common_genes). This function will filter for only those genes.

Usage

filter_common_genes(
  df,
  id = c("entrezgene_id", "hgnc_symbol"),
  tidy = FALSE,
  tell_missing = TRUE,
  find_alias = FALSE
)
filter_common_genes(
  df,
  id = c("entrezgene_id", "hgnc_symbol"),
  tidy = FALSE,
  tell_missing = TRUE,
  find_alias = FALSE
)

Arguments

`df`	a `data.frame` of RNA expression values, with columns corresponding to samples, and rows corresponding to genes. Either rownames or the first column can contain gene IDs (see `tidy`)
`id`	either `"entrezgene_id"` or `"hgnc_symbol"`, whichever `df` contains.
`tidy`	logical. If rownames contain gene identifier, set `FALSE`. If first column contains gene identifier, set `TRUE`
`tell_missing`	logical. If `TRUE`, prints message of genes in common gene set that are not in supplied data frame.
`find_alias`	logical. If `TRUE` and `id = "hgnc_symbol"`, will attempt to find if genes missing from `common_genes` are going under an alias. See details for more information.

Details

The find_aliases argument will attempt to find aliases for HGNC symbols in tidyestimate::common_genes but missing from the provided dataset. This will only run if find_aliases = TRUE and id = "hgnc_symbol".

This algorithm is very conservative: It will only make a match if the gene from the common genes has only one alias that matches with only one gene from the provided dataset, and the gene from the provided dataset with which it matches only matches with a single gene from the list of common genes. (Note that a single gene may have many aliases). Once a match has been made, the gene in the provided dataset is updated to the gene name in the common gene list.

While this method is fairly accurate, is is also a heuristic. Therefore, it is disabled by default. Users should check which genes are becoming reassigned to ensure accuracy.

The method of generation of these aliases can be found at ?tidyestimate::common_genes

Value

A tibble, with gene identifiers as the first column

Examples

filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = FALSE)
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = FALSE)

Gene sets to infer tumor stromal and immune infiltration

Description

Two gene sets, each 141 genes in length, created to infer stromal and immune infiltration

Usage

gene_sets
gene_sets

Format

A data frame with 141 row and 2 variables:

stromal_signature: Geneset of HGNC symbols used to infer tumor stromal cell infiltration
immune_signature: Geneset of HGNC symbols used to infer tumor immune cell infiltration

Source

https://r-forge.r-project.org/scm/viewvc.php/pkg/estimate/data/SI_geneset.RData?root=estimate&view=log

Ovarian cancer tumor RNA expression

Description

A matrix containing RNA expression of 10 ovarian cancer tumors, measured using the Affymetrix U133Plus2.0 platform. These data have been rounded to the 4th decimal place to reduce file size.

Usage

ov
ov

Format

A matrix with 17256 rows and 10 columns, where each column represents a tumor, and each row represents a gene. Genes are represented by HGNC symbols in the rownames.

Source

https://r-forge.r-project.org/scm/viewvc.php/pkg/estimate/inst/extdata/sample_input.txt?root=estimate&view=log

Plot Affymetrix purity scores against ESTIMATE study purity scores

Description

Plot Affymetrix purity scores against ESTIMATE study purity scores

Usage

plot_purity(scores, is_affymetrix)
plot_purity(scores, is_affymetrix)

Arguments

`scores`	a `data.frame`, usually one output from `estimate_score`
`is_affymetrix`	logical. Are these data from an Affymetrix experiment? Must be `TRUE` - this is essentially a verification from the user

Value

a ggplot

Examples

filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> 
  estimate_score(is_affymetrix = TRUE) |>
  plot_purity(is_affymetrix = TRUE)

filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> 
  estimate_score(is_affymetrix = TRUE) |>
  plot_purity(is_affymetrix = TRUE)

Affymetrix data used to train ESTIMATE algorithm

Description

A data frame containing the ABSOLUTE-measured and ESTIMATE-predicted purity values of 995 tumors. Additionally, stromal and immune scores as calculated by ESTIMATE. All tumors were profiled on Affymetrix arrays, and were used to generate the Affymetrix algorithm.

Usage

purity_data_affy
purity_data_affy

Format

A data frame with 995 rows and 7 variables:

purity_observed: The purity of a tumor given by ABSOLUTE, ranging from 0 (least pure) to 1 (most pure)
stromal: Stromal infiltration score, as measured by ESTIMATE
immune: Immune infiltration score, as measured by ESTIMATE
estimate: ESTIMATE score, calculated by the sum of immune and stromal scores
purity_predicted: Tumor purity inferred using the ESTIMATE algorithm
ci_95_low: Lower bound of a 95% confidence interval of predicted purity scores
ci_95_high: Upper bound of a 95% confidence interval of predicted purity scores

Source

https://r-forge.r-project.org/scm/viewvc.php/pkg/estimate/data/PurityDataAffy.RData?root=estimate&view=log

tidyestimate: A modern implementation of the ESTIMATE algorithm

Description

The tidyestimate is a lightweight, fast, pipe-friendly re-imagination of the ESTIMATE package. tidyestimate is used to infer tumor purity from expression data.

Authors

Author (tidyestimate):

* Kai Aragaki ([ORCID](http://orcid.org/0000-0002-9458-0426)) (author, maintainer)

Authors (ESTIMATE):

* Kosuke Yoshihara [email protected] (author) * P. Roebuck [email protected] (author, copyright holder)

Reference

https://www.nature.com/articles/ncomms3612

Package 'tidyestimate'

Help Index

Genes shared between six expression platforms

Description

Usage

Format

Details

Source

Infer tumor purity using the ESTIMATE algorithm

Description

Usage

Arguments

Details

Value

References

Examples

Remove non-common genes from data frame

Description

Usage

Arguments

Details

Value

Examples

Gene sets to infer tumor stromal and immune infiltration

Description

Usage

Format

Source

Ovarian cancer tumor RNA expression

Description

Usage

Format

Source

Plot Affymetrix purity scores against ESTIMATE study purity scores

Description

Usage

Arguments

Value

Examples

Affymetrix data used to train ESTIMATE algorithm

Description

Usage

Format

Source

tidyestimate: A modern implementation of the ESTIMATE algorithm

Description

Authors

Reference