---
title: "Using tidyestimate"
author: "Kai Aragaki"
date: "`r format(Sys.Date(), '%Y-%m%-%d')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using tidyestimate}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tidyestimate)
```

# Introduction

`tidyestimate` is a tidy implementation of the `estimate` package and paper (Yoshihara et al. 2013).

`estimate` infers tumor purity from gene expression data by performing gene-set-enrichment-analysis (GSEA)(Subramanian et al. 2005) on a single-sample basis (ssGSEA)(Barbie et al. 2009), using both a stromal and immune gene set. By adding these two signatures together, an ESTIMATE score can then be converted into an inferred purity score.

Whereas `estimate` inputs and outputs were largely `.GCT` files, `tidyestimate` simplifies the `estimate` package by taking a data frame, tibble, or matrix as an input, and returning a tibble. This eliminates the need for intermediate files, reduces execution time, and allows for tidier (pipe-compatible) code. Furthermore, `tidyestimate` provides an optional conservative alias-matching algorithm to capture genes that may be listed under a different name within the dataset.

# Use

The dataset we will be using is a set of 10 ovarian cancer tumors, whose expression was profiled using Affymetrix arrays

```{r}
dim(ov)
```

```{r}
head(ov[, 1:5])
```


This dataset is a matrix, but it could just as easily be a `tibble` or `data.frame`. The identifiers can either be HGNC symbols or Entrez IDs - here we have HGNC symbols.

The ESTIMATE algorithm is sensitive to number of genes included. As such, we will filter for genes that are common to six different expression profiling platforms (see `?tidyestimate::common_genes` for more information).

```{r}
filtered <- filter_common_genes(ov, 
                                id = "hgnc_symbol", 
                                tidy = FALSE, 
                                tell_missing = TRUE, 
                                find_alias = TRUE)
```

You will notice there are several arguments:

- The first is the dataset (this allows for piping, if need be - as is the case for all functions in `tidyestimate`). 

- `id` specifies the type of gene ID. As mentioned, it can either be an HGNC symbol (`"hgnc_symbol"`) or an Entrez ID (`"entrezgene_id"`).

- The `tidy` argument allows you to specify if the IDs are rownames, or if they are in the first column of the provided input. Set `tidy = TRUE` if identifiers are in the first column of the data, and `FALSE` if they are encoded in the rownames. This option for an input of a tibble, which does not allow for rownames. It lacks a default value to ensure the operator is aware of the decision they are making, lest one sample be mistaken for names or vice-versa.

- The `tell_missing` argument determines whether a message should be emitted about genes that failed to find a match in the dataset.

- The `find_alias` argument determine whether, upon failing to find some of the common genes in your dataset, it should see if they are listed under an alias. The method used is quite conservative: it will only assume a match is a match if there is a one-to-one match between one gene name and one alias in the provided dataset.

After we have put our dataset on common ground, we can calculate the ESTIMATE score for each sample:

```{r}
scored <- estimate_score(filtered,
                         is_affymetrix = TRUE)
scored
```

Note that the `estimate_score` function takes an `is_affymetrix` argument. This argument determines whether or not a purity score will be calculated for the samples. As the data used to train the model to convert ESTIMATE scores to purity scores were produced by Affymetrix, it is unwise to infer a tumor purity score using the same method for RNAseq data. However, stromal and immune infiltration as well as ESTIMATE scores can be used to measure relative purity in RNAseq samples (versus absolute 0 to 1 purity inferred for Affymetrix samples). For instance, sample `s516` has less stroma than it does immune cell infiltrate. Further, it has more stromal infiltration than sample `s518`. By looking at the ESTIMATE scores, we can see that `s516` is less pure than `s518`. This can be said absolutely with the purity values: `s516` is roughly 83% pure, a metric that can be used both within and across studies.

In the case of Affymetrix samples, where a purity score can be calculated, you may want to see where your sample stands in the model generated by Yoshihara et al. This can be done simply with plotting:

```{r, fig.width=6, fig.height=6}
plot_purity(scored, is_affymetrix = TRUE)
```

The `is_affymetrix` argument is a bit of a false choice - if `FALSE`, the function will remind the user that purity scores for non-Affymetrix data are not supported, then stop.

On this plot, gray circles represent the Affymetrix samples used to train the model. Their tumor purity was measured using the ABSOLUTE method (Carter et al. 2012), and a model was fit using an evolutionary model that takes the form of

$$\cos(0.6049872018 + 0.0001467884 * ESTIMATE)$$

The red circles, then, represent the input samples, and have been labeled by their sample name (column names in the original matrix)