---
title: "case-study"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{case-study}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, message = FALSE}
library(reclanc)
library(aws.s3)
library(Biobase)
```

# Introduction

Let's consider a relatively full-featured, practical use case for `reclanc`. In this vignette, we'll go over the basics of fitting models, as well as how to leverage `tidymodels` to do more elaborate things like resampling and tuning hyperparameters. We'll fit a final model, then use that to predict subtypes of an entirely new dataset.

This vignette tries to assume very little knowledge about machine learning or `tidymodels`.


# Fitting

## A simple fit

Let's start with the fitting procedure. We first need gene expression data.

The data I'm using is from [Sjödahl et al. (2012)](https://pubmed.ncbi.nlm.nih.gov/22553347/). It contains RNA expression from 308 bladder cancer tumors. 

```{r}
lund <- s3readRDS("lund.rds", "reclanc-lund", region = "us-east-2")
lund
```

In their paper, Sjödahl et al. used the transcriptional data to classify the tumors into seven molecular subtypes (MS):

```{r}
table(lund$molecular_subtype)
```

We'd like to apply this subtype framework to other datasets. To do this, we first need to generate centroids. Before we can begin, though, we need to convert our outcomes to factors. In this case, our outcomes are the molecular subtypes:

```{r}
lund$molecular_subtype <- factor(lund$molecular_subtype)
```

In its simplest form, since `clanc` accepts `ExpressionSet` objects, we could do the following and be done with it:

```{r}
simple_centroids <- clanc(lund, classes = "molecular_subtype", active = 5)
head(simple_centroids$centroids)
```

The problem with this method, though, is we have no idea if this is a good fit or not. `active` is an argument that specifies the number of genes that are used as distinguishing features for a given class. In this case, each class will find 5 genes that have expression patterns peculiar to that given molecular subtype, and each subtype will have 7 (the total number of subtypes) x 5 (number of active genes) = 35 genes in it (see my [blog post](https://kai.rbind.io/posts/projects-reclanc/) or - better yet - [the original paper](http://www.ncbi.nlm.nih.gov/pubmed/16174683) for more details). Could we have gotten a better fit with more genes? Are we selecting more genes than we need? How would we know?

## Setting the stage for more elaborate analyses

Before we can get started on tackling these larger questions, let's take a brief detour to the land of [`tidymodels`](https://www.tidymodels.org/). `tidymodels` is a collection of packages that make running and tuning algorithms like this much less painful and much more standardized.

In order to leverage `tidymodels`, we need to buy-in to their data structures.

(Aside: I don't mean to make the buy-in sound begrudging. When I say need, I really mean it: we're going to be specifying very long formulas, which for some reason R really, *really* hates. Emil Hvitfeldt recently (at time of writing) has [allowed `tidymodels` to handle long formulas gracefully](https://github.com/tidymodels/recipes/pull/1283), so using `tidymodels` infrastructure is a gift, not a chore.)

```{r, message = FALSE}
library(tidymodels)
```

Many `tidymodels` workflows begin with a *model specification*. The rationale behind this is to separate the *model specification* step from the *model fitting* step (whereas in base R, they generally all happen at once). `reclanc` makes it easy to specify a model by adding a custom engine to `parsnip::discrim_linear`, so specifying a model looks like this:

```{r}
mod <- discrim_linear() |>
  set_engine(
    engine = "clanc", # Note: "clanc", not "reclanc"
    active = 5
  )
```

This `mod` doesn't do anything - and that's kind of the point: it only specifies the model we will later fit with, but doesn't do any fitting itself. This allows us to reuse the specification across our code.

The next step is to wrangle our data a bit to be in a 'wide' format, where all columns are outcomes (classes) and predictors (genes), and all rows are observations (samples):

```{r}
wrangled <- data.frame(class = lund$molecular_subtype, t(exprs(lund)))
head(wrangled[1:5])
```

Finally, we specify a formula for fitting the model. This uses the `recipes` package from `tidymodels`. While this is a delightful package that can help you preprocess your data, it's out of the scope of this vignette. Instead, just think of it as a way to specify a formula that keeps R from blowing up:

```{r}
# Note that the recipe requires 'template data'
recipe <- recipe(class ~ ., wrangled) 
```

We can bundle our model specification (`mod`) and our preprocessing steps (`recipe`, which is just a formula) into a `workflow`:

```{r}
wf <- workflow() |>
  add_recipe(recipe) |>
  add_model(mod)
wf
```

Now we can fit our model:

```{r}
tidymodels_fit <- fit(wf, data = wrangled)
head(extract_fit_parsnip(tidymodels_fit)$fit$centroids)
```

You'll notice that our results are the same as what we saw previously, demonstrating that while we're using `tidymodels` rather than base R, we're still doing the same thing.

## Measuring fit accuracy with cross-validation

Now that we've dialed in to the `tidymodels` framework, we can do a lot of elaborate things with ease. One of our concerns is whether 5 active genes was a good choice (`active = 5`). A somewhat simple way to determine how good our choice of 5 genes is to use [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Cross-validation allows us to test how good our fit is by training our model on, say, 80% of our data, and testing it on the rest (see the Wikipedia diagram of a k-fold cross validation). This allows us to get a measure of how good our fit is, without having to break out our actual test data - which in general should only be used when we're ready to finalize our model.

Speaking of test data, let's go ahead and split that off now. We'll lock our test data away and only use it once we've fit our final model. Until then, we'll use cross validation to assess how good the fit is, essentially using our training data as its own testing data.

Of course, `tidymodels` makes this easy too, by using `rsample::initial_split`:

```{r}
set.seed(123)
splits <- initial_split(wrangled, prop = 0.8, strata = class)
train <- training(splits)
test <- testing(splits)
```

`train` and `test` are just subsets of the original data, containing 80% and 20% of the original data (respectively). It also tries to maintain the relative proportions of each of the classes within each of the datasets (because we set `strata = class`):

```{r}
round(prop.table(table(train$class)), 2)
```

```{r}
round(prop.table(table(test$class)), 2)
```

Creating folds for cross validation is nearly the same as `initial_split`:

```{r}
folds <- vfold_cv(train, v = 5, strata = class)
folds
```

We can reuse our workflow `wf`, which contains our model and formula. The only difference is that we use `fit_resamples`, and we specify a metric we want to use to measure how good our fit is (remember that every fold has a chunk of data it uses to test the fit). For simplicity, let's use accuracy:

```{r}
fits <- fit_resamples(
  wf,
  folds,
  metrics = metric_set(accuracy)
)
fits
```

We can then extract our accuracy metrics by using `collect_metrics`, which roots around in each of our fits and helpfully extracts the metrics, aggregates them, and calculated the standard error:

```{r}
metrics <- collect_metrics(fits)
metrics
```

Our model has an accuracy of about `r round(metrics$mean*100)`%. Applying this model to our testing data:

```{r}
# Fit a model using *all* of our training data
final_fit <- clanc(class ~ ., train, active = 5)

# Use it to predict the (known) classes of our test data
preds <- predict(final_fit, new_data = test, type = "class")
w_preds <- cbind(preds, test)
# Compare known class vs predicted class
metric <- accuracy(w_preds, class, .pred_class)
metric
```

Note that our testing data accuracy (`r round(metric$mean*100)`%) approximates the training data accuracy (`r round(metrics$mean*100)`%).

## Tuning hyperparameters with `tune`

Now we at least have *some* measure of how good our model fits, but could it be better with more genes? Could we get away with fewer? Running the same command over and over again with different numbers is a drag - fortunately, there's yet another beautiful package to help us: `tune`.

To use `tune`, we need to re-specify our model to let `tune` know what parameters we want to tune:

```{r}
tune_mod <- discrim_linear() |>
  set_engine(
    engine = "clanc",
    active = tune()
  )
```

We could update our previous workflow using `update_model`, but let's just declare a new one:

```{r}
tune_wf <- workflow() |>
  add_recipe(recipe) |>
  add_model(tune_mod)
```

We then have to specify a range of values of `active` to try:

```{r}
values <- data.frame(active = seq(from = 1, to = 50, by = 4))
values
```

We can then fit our `folds` using the spread of values we chose:

```{r, message = FALSE}
# This is going to take some time, since we're fitting 5 folds 13 times each.
tuned <- tune_grid(
  tune_wf,
  folds,
  metrics = metric_set(accuracy),
  grid = values
)
tuned
```

As before, we can collect our metrics - this time, however, we have a summary of metrics for each of values for `active`:

```{r}
tuned_metrics <- collect_metrics(tuned)
tuned_metrics
```

Or graphically:

```{r}
ggplot(tuned_metrics, aes(active, mean)) +
  geom_line() +
  coord_cartesian(ylim = c(0, 1)) +
  labs(x = "Number Active Genes", y = "Accuracy")
```

It looks like we read maximal accuracy at around 21 genes - let's choose 20 genes for a nice round number:

```{r}
final_fit_tuned <- clanc(class ~ ., data = train, active = 20)
# Use it to predict the (known) classes of our test data:
preds <- predict(final_fit_tuned, new_data = test, type = "class")
w_preds <- cbind(preds, test)
# Compare known class vs predicted class:
metric <- accuracy(w_preds, class, .pred_class)
metric
```

It looks like our accuracy is a little better now that we've chosen an optimal number of active genes.

# Predicting

Now we want to apply our classifier to new data. Our second dataset is [RNAseq data from 30 bladder cancer cell lines](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE97768):

```{r, message = FALSE}
library(cellebrate)
cell_rna
```

Predicting is incredibly simple. Since we're using a different sequencing method (RNAseq vs array-based sequencing), it probably makes sense to use a correlation based classification rather than the original distance-based metric used in the original ClaNC package. We can do that by specifying `type = "numeric"` and then whatever correlation method we prefer. 

```{r}
cell_preds <- predict(
  final_fit_tuned,
  cell_rna,
  assay = 2,
  type = "numeric",
  method = "spearman"
)

out <- cbind(colData(cell_rna), cell_preds) |>
  as_tibble()

out
```

```{r, fig.width = 10, fig.height = 7}
plotting_data <- out |>
  pivot_longer(cols = starts_with(".pred"))

plotting_data |>
  ggplot(aes(cell, value, color = name)) +
  geom_point() +
  facet_grid(~clade, scales = "free_x", space = "free_x")
```

In the Sjödahl paper, the seven subtypes were simplified into five subtypes by merging some of the two that had similar biological pathways activated. To ease interpretation, we can do that too:

```{r}
table <- plotting_data |>
  summarize(winner = name[which.max(value)], .by = c(cell, clade)) |>
  mutate(
    five = case_when(
      winner %in% c(".pred_MS1a", ".pred_MS1b") ~ "Urobasal A",
      winner %in% c(".pred_MS2a.1", ".pred_MS2a.2") ~ "Genomically unstable",
      winner == ".pred_MS2b.1" ~ "Infiltrated",
      winner == ".pred_MS2b2.1" ~ "Uro-B",
      winner == ".pred_MS2b2.2" ~ "SCC-like"
    )
  ) |>
  relocate(cell, five, clade)

print(table, n = 30)
```