Title: | A Tidy Implementation of 'ESTIMATE' |
---|---|
Description: | The 'ESTIMATE' package infers tumor purity from expression data as a function of immune and stromal infiltrate, but requires writing of intermediate files, is un-pipeable, and performs poorly when presented with modern datasets with current gene symbols. 'tidyestimate' a fast, tidy, modern reimagination of 'ESTIMATE' (2013) <doi:10.1038/ncomms3612>. |
Authors: | Kai Aragaki [aut, cre] |
Maintainer: | Kai Aragaki <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.1.1.9000 |
Built: | 2025-02-01 05:13:36 UTC |
Source: | https://github.com/KaiAragaki/tidyestimate |
As the ESTIMATE model was trained on a specific set of genes,
only those within this dataset should be included before running
estimate_scores
.
These are the genes common to 6 platforms:
- Affymetrix HG-U133Plus2.0
- Affymetrix HT-HG-U133A
- Affymetrix Human X3P
- Agilent 4x44K (G4112F)
- Agilent G4502A
- Illumina HiSeq RNA sequence
The Entrez IDs for the original 10412 genes were matched to HGNC symbols
using biomaRt
. Duplicates and blank entries were filtered. As some
have now been discovered to be pseudogenes or have been deprecated, 22
genes (at time of writing, June 2021) that were in the ESTIMATE package do
not exist here.
As one gene can have multiple synonyms/aliases, and there is only one alias per line, the number of rows in the data frame (26339) does not reflect the number of unique genes in the dataset (10391).
common_genes
common_genes
A data frame with 26339 rows and 3 variables:
Entrez id of the gene
Human Genome Organisation (HUGO) Gene Nomenclature Committee symbol
A synonym/alias a given gene may go by or previously went by
The ESTIMATE model was trained on a set of genes shared between six expression profiling platforms. Those genes are listed in this dataset.
Infer tumor purity by using single-sample gene-set-enrichment-analysis with stromal and immune cell signatures.
estimate_score(df, is_affymetrix)
estimate_score(df, is_affymetrix)
df |
a |
is_affymetrix |
logical. Is the expression data from an Affymetrix array? |
ESTIMATE (and this tidy implementation) infers tumor infiltration using two
gene sets: a stromal signature, and an immune signature (see
tidyestimate::gene_sets
).
Enrichment scores for each sample are calculated using an implementation of single sample Gene Set Enrichment Analysis (ssGSEA). Briefly, expression is ranked on a per-sample basis, and the density and distribution of gene signature 'hits' is determined. An enrichment of hits at the top of the expression ranking confers a positive score, while an enrichment of hits at the bottom of the expression ranking confers a negative score.
An 'ESTIMATE' score is calculated by adding the stromal and immune scores together.
For Affymetrix arrays, an equation to convert an ESTIMATE score to a prediction of tumor purity has been developed by Yoshihara et al. (see references). It takes the approximate form of:
Values have been rounded to two significant figures for display purposes.
A data.frame
with sample names, as well as scores for stromal,
immune, and ESTIMATE scores per tumor. If is_affymetrix = TRUE
,
purity scores as well.
Purity scores can be interpreted absolutely: a purity of 0.9 means that tumor is likely 90 available (such as in RNAseq), ESTIMATE scores can only be interpreted relatively: a sample that has a lower ESTIMATE score than another in one study can be regarded as more pure than another, but its absolute purity cannot be inferred, nor can purity across other studies be inferred.
Barbie et al. (2009) <doi:10.1038/nature08460>
Yoshihara et al. (2013) <doi:10.1038/ncomms3612>
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> estimate_score(is_affymetrix = TRUE)
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> estimate_score(is_affymetrix = TRUE)
As ESTIMATE score calculation is sensitive to the number of genes used, a set
of common genes used between six platforms has been established (see
?tidyestimate::common_genes
). This function will filter for only those
genes.
filter_common_genes( df, id = c("entrezgene_id", "hgnc_symbol"), tidy = FALSE, tell_missing = TRUE, find_alias = FALSE )
filter_common_genes( df, id = c("entrezgene_id", "hgnc_symbol"), tidy = FALSE, tell_missing = TRUE, find_alias = FALSE )
df |
a |
id |
either |
tidy |
logical. If rownames contain gene identifier, set |
tell_missing |
logical. If |
find_alias |
logical. If |
The find_aliases
argument will attempt to find aliases for HGNC
symbols in tidyestimate::common_genes
but missing from the provided
dataset. This will only run if find_aliases = TRUE
and id =
"hgnc_symbol"
.
This algorithm is very conservative: It will only make a match if the gene from the common genes has only one alias that matches with only one gene from the provided dataset, and the gene from the provided dataset with which it matches only matches with a single gene from the list of common genes. (Note that a single gene may have many aliases). Once a match has been made, the gene in the provided dataset is updated to the gene name in the common gene list.
While this method is fairly accurate, is is also a heuristic. Therefore, it is disabled by default. Users should check which genes are becoming reassigned to ensure accuracy.
The method of generation of these aliases can be found at
?tidyestimate::common_genes
A tibble
, with gene identifiers as the first column
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = FALSE)
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = FALSE)
Two gene sets, each 141 genes in length, created to infer stromal and immune infiltration
gene_sets
gene_sets
A data frame with 141 row and 2 variables:
Geneset of HGNC symbols used to infer tumor stromal cell infiltration
Geneset of HGNC symbols used to infer tumor immune cell infiltration
A matrix containing RNA expression of 10 ovarian cancer tumors, measured using the Affymetrix U133Plus2.0 platform. These data have been rounded to the 4th decimal place to reduce file size.
ov
ov
A matrix with 17256 rows and 10 columns, where each column represents a tumor, and each row represents a gene. Genes are represented by HGNC symbols in the rownames.
Plot Affymetrix purity scores against ESTIMATE study purity scores
plot_purity(scores, is_affymetrix)
plot_purity(scores, is_affymetrix)
scores |
a |
is_affymetrix |
logical. Are these data from an Affymetrix experiment?
Must be |
a ggplot
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> estimate_score(is_affymetrix = TRUE) |> plot_purity(is_affymetrix = TRUE)
filter_common_genes(ov, id = "hgnc_symbol", tidy = FALSE, tell_missing = TRUE, find_alias = TRUE) |> estimate_score(is_affymetrix = TRUE) |> plot_purity(is_affymetrix = TRUE)
A data frame containing the ABSOLUTE-measured and ESTIMATE-predicted purity values of 995 tumors. Additionally, stromal and immune scores as calculated by ESTIMATE. All tumors were profiled on Affymetrix arrays, and were used to generate the Affymetrix algorithm.
purity_data_affy
purity_data_affy
A data frame with 995 rows and 7 variables:
The purity of a tumor given by ABSOLUTE, ranging from 0 (least pure) to 1 (most pure)
Stromal infiltration score, as measured by ESTIMATE
Immune infiltration score, as measured by ESTIMATE
ESTIMATE score, calculated by the sum of immune and stromal scores
Tumor purity inferred using the ESTIMATE algorithm
Lower bound of a 95% confidence interval of predicted purity scores
Upper bound of a 95% confidence interval of predicted purity scores
The tidyestimate is a lightweight, fast, pipe-friendly re-imagination of the ESTIMATE package. tidyestimate is used to infer tumor purity from expression data.
Author (tidyestimate):
* Kai Aragaki ([ORCID](http://orcid.org/0000-0002-9458-0426)) (author, maintainer)
Authors (ESTIMATE):
* Kosuke Yoshihara [email protected] (author) * P. Roebuck [email protected] (author, copyright holder)
https://www.nature.com/articles/ncomms3612