Interpreting RCTD weights

Author

Istvan Kleijn

Published

February 26, 2024

In this document, we run spacexr’s RCTD algorithm on simple synthetic data to infer that the weights matrix should be interpreted as the proportion of RNA molecules originating from each cell type in each spot, rather than the fraction of cells assigned to each cell type.

library(spacexr)

Create mock cell types

As the first step, we create two mock cell types with corresponding “gene” expression. RCTD requires at least 10 differentially expressed genes. I choose to have five genes specific to each of two cell types A and B, and five constitutively expressed genes that have equal counts across each cell. Crucially, we choose cell type B to have a higher total expression than cell type A, because its markers are more highly expressed.

typeA <- c(
  "a1" = 1,
  "a2" = 2,
  "a3" = 3,
  "a4" = 4,
  "a5" = 5,
  "b1" = 0,
  "b2" = 0,
  "b3" = 0,
  "b4" = 0,
  "b5" = 0,
  "c1" = 1,
  "c2" = 2,
  "c3" = 3,
  "c4" = 4,
  "c5" = 5
)
typeB <- c(
  "a1" = 0,
  "a2" = 0,
  "a3" = 0,
  "a4" = 0,
  "a5" = 0,
  "b1" = 10,
  "b2" = 20,
  "b3" = 30,
  "b4" = 40,
  "b5" = 50,
  "c1" = 1,
  "c2" = 2,
  "c3" = 3,
  "c4" = 4,
  "c5" = 5
)

Create RCTD objects

First we create the RCTD reference. At some point down the line, RCTD errors with only one cell per cell type in the reference, so the reference contains two identical cells of both cell types.

reference_counts <- data.frame(
  cell1 = typeA, cell2 = typeA, cell3 = typeB, cell4 = typeB
) |>
  as.matrix()
reference_types <- factor(
  c(
    "cell1" = "typeA", "cell2" = "typeA",
    "cell3" = "typeB", "cell4" = "typeB"
  ),
  levels = c("typeA", "typeB")
)
reference_nUMI <- colSums(reference_counts)
reference <- spacexr::Reference(
  reference_counts,
  reference_types,
  reference_nUMI,
  min_UMI = 10
)

Next, we mock up some spatial data. We let spot 1 contain 1 cell of type A and 1 cell of type B. Spot 2 contains 1 cell A and 3 cells of type B, spot 3 contains 3 cells of type A and one of type B, and spot 4 contains 3 cells of each type.

spatial_coords <- tibble::tribble(
  ~spot, ~x, ~y,
  "spot1", 0, 0,
  "spot2", 1, 0,
  "spot3", 0, 1,
  "spot4", 1, 1
) |>
  tibble::column_to_rownames("spot")
spatial_counts <- data.frame(
  spot1 = typeA + typeB,
  spot2 = typeA + 3 * typeB,
  spot3 = 3 * typeA + typeB,
  spot4 = 3 * typeA + 3 * typeB
) |>
  as.matrix()
nUMI <- colSums(spatial_counts)
puck <- SpatialRNA(spatial_coords, spatial_counts, nUMI)

Finally, we create the RCTD object itself from the mock reference and spatial data. We have to provide a few parameters to enable the small size of the mock data.

myRCTD <- create.RCTD(
  puck, reference,
  max_cores = 1,
  counts_MIN = 1,
  CELL_MIN_INSTANCE = 1
  )
Begin: process_cell_type_info
process_cell_type_info: number of cells in reference: 4
process_cell_type_info: number of genes in reference: 15

typeA typeB 
    2     2 
End: process_cell_type_info
create.RCTD: getting regression differentially expressed genes: 
get_de_genes: typeA found DE genes: 10
get_de_genes: typeB found DE genes: 5
get_de_genes: total DE genes: 15
create.RCTD: getting platform effect normalization differentially expressed genes: 
get_de_genes: typeA found DE genes: 10
get_de_genes: typeB found DE genes: 5
get_de_genes: total DE genes: 15

Note that it finds 10 marker genes for typeA, which means that the constitutively expressed genes are included. Indeed, their relative abundance is higher in cell type A than in cell type B.

Run RCTD to find cell type fractions

Now we finally run the RCTD algorithm to get cell type proportions. I always use full mode on my own data so that is what I do here. For this mock data using doublet mode would be appropriate (it has the same interpretation).

myRCTD <- run.RCTD(myRCTD, doublet_mode = "full")
fitBulk: decomposing bulk
chooseSigma: using initial Q_mat with sigma =  1
Likelihood value: 113.745139064746
Sigma value:  0.84
Likelihood value: 109.153842621005
Sigma value:  0.69
Likelihood value: 104.184989239904
Sigma value:  0.61
Likelihood value: 101.205842339197
Sigma value:  0.53
Likelihood value: 97.9501354713587
Sigma value:  0.45
Likelihood value: 94.3725160205627
Sigma value:  0.37
Likelihood value: 90.4213192913567
Sigma value:  0.29
Likelihood value: 86.0445136616334
Sigma value:  0.21

Interpretation

The moment of truth:

myRCTD@results$weights
4 x 2 sparse Matrix of class "dgCMatrix"
           typeA     typeB
spot1 0.15105219 0.8358883
spot2 0.05609608 0.9349949
spot3 0.34829858 0.6393917
spot4 0.15169237 0.8395874

Clearly, these weights do not correspond to the fraction of cells in a spot that are of each cell type. If so, the matrix would have been equal to

fractions <- tibble::tribble(
  ~spot, ~typeA, ~typeB,
  "spot1", 0.5, 0.5,
  "spot2", 0.25, 0.75,
  "spot3", 0.75, 0.25,
  "spot4", 0.5, 0.5
) |>
  tibble::column_to_rownames("spot") |>
  as.matrix()
fractions
      typeA typeB
spot1  0.50  0.50
spot2  0.25  0.75
spot3  0.75  0.25
spot4  0.50  0.50

Instead, the weights represent the proportion of RNA molecules in a spot that originated in cells of each type.

molsA <- sum(typeA)
molsB <- sum(typeB)
proportions <- matrix(
  c(
    c(molsA, molsB)/(molsA + molsB),
    c(molsA, 3*molsB)/(molsA + 3*molsB), 
    c(3*molsA, molsB)/(3*molsA + molsB),
    c(3*molsA, 3*molsB)/(3*molsA + 3*molsB)
  ),
  ncol = 2,
  byrow = TRUE,
  dimnames = list(
    c("spot1", "spot2", "spot3", "spot4"),
    c("typeA", "typeB")
  )
)
proportions
           typeA     typeB
spot1 0.15384615 0.8461538
spot2 0.05714286 0.9428571
spot3 0.35294118 0.6470588
spot4 0.15384615 0.8461538

As a final note, the weights were very close to being normalised already, but for completion’s sake:

normalize_weights(myRCTD@results$weights)
4 x 2 Matrix of class "dgeMatrix"
           typeA     typeB
spot1 0.15305096 0.8469490
spot2 0.05660033 0.9433997
spot3 0.35263948 0.6473605
spot4 0.15302681 0.8469732
sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United Kingdom.utf8 
[2] LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] spacexr_2.2.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       doParallel_1.0.17 cli_3.6.2         knitr_1.45       
 [5] rlang_1.1.3       xfun_0.41         renv_1.0.3        jsonlite_1.8.8   
 [9] glue_1.7.0        htmltools_0.5.7   fansi_1.0.6       rmarkdown_2.25   
[13] quadprog_1.5-8    grid_4.3.2        evaluate_0.23     tibble_3.2.1     
[17] fastmap_1.1.1     yaml_2.3.8        foreach_1.5.2     lifecycle_1.0.4  
[21] compiler_4.3.2    codetools_0.2-19  pkgconfig_2.0.3   lattice_0.21-9   
[25] digest_0.6.34     utf8_1.2.4        pillar_1.9.0      parallel_4.3.2   
[29] magrittr_2.0.3    Matrix_1.6-1.1    tools_4.3.2       iterators_1.0.14