KRSA Vignette • KRSA

library(KRSA)
library(knitr)
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
#> ✔ tibble  3.1.8      ✔ dplyr   1.0.10
#> ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
#> ✔ readr   2.1.3      ✔ forcats 0.5.2 
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

library(gt) # can be used to view tables
library(furrr) # can be used for parallel computing 
#> Loading required package: future

Introduction

Background

The Pamstation12 instrument provides a profiling of kinase activity of cell or tissue samples. The device is loaded with either serine/threonine or tyrosine microarray chips. Each chip has 4 wells so four samples can be loaded on a single chip, and the Pamstation12 can accommodate 3 chips per run. The microarray represents 144 (STK chip) or 196 (PTK chip) reporter peptides that can be phosphorylated by serine/threonine or tyrosine kinases. The device measures the degree of the phosphorylation in real time by detecting fluorescently labeled antibodies at different exposure times. The list of peptides present in each microarray can be viewed here: STK chip, PTK chip

knitr::include_graphics("images/pamgene_workflow.png")

knitr::include_graphics("images/pamgene_detectionFig.png")

Run Design

Designing the placement of the samples on the chips and arrays is important to consider due to the variability across different chips and batches. During the run some wells are subject to fail and their data cannot be analyzed and shown below as red.

tt <- tibble(
  num = rep(c(3.5,2.5,1.5,0.5),3),
  chips = rep(c(2,5,8),each=4),
  colr = c(rep(1,12)),
  nms = rep(c("A", "B", "C", "D"), 3)
)
rundes <- ggplot() +
  geom_rect(aes(xmin=1, xmax=3, ymin=0,ymax=4), fill="gray30") +
  geom_rect(aes(xmin=4, xmax=6, ymin=0,ymax=4), fill="gray30") +
  geom_rect(aes(xmin=7, xmax=9, ymin=0,ymax=4), fill="gray30") +
  xlim(0,10) + 
  theme_void() 

rundes + geom_point(data = tt, aes(chips, num,color=factor(colr)), size = 13, show.legend = F) +
  geom_text(data = tt, aes(chips,num,label = nms), size = 2) +
  scale_color_manual(values=c("green3", "red3"))

Results

Image Analysis

The first step of analyzing the run is to convert the images taken by the PamStation of each array at different exposure times to numerical values This is done by the Bionavigator software developed by Pamgene. The software recognizes the grid of the array with the aid of the searching algorithm (Pamgrid) to correctly identify each spot on the array. The numbers produced by this software represent the median value of the foreground pixels minus the median value of the background pixels to produce the median signal minus background (Median_SigmBg).

Reading Data

The first step will be reading the crosstab view bionavigator files (Median_SigmBg and Signal_Saturation) and defining the PamChip type (STK or PTK). The raw data is read and then transformed to be in tidy format for an easier analysis, modeling, and visualizing.


# Define chip type
chipType <- "STK"

# Read crosstab view bionavigator files and tidy data
data <- krsa_read("data_files/example_Median_SigmBg.txt", "data_files/example_Signal_Saturation.txt")

QC Initial Steps and Groups Assignments

We will perform a couple of quality control steps to deal with negative values in the data and adjust based on signal saturation (optional). Next, we will define a new column to represent the grouping. And then, we will extract end point signal values

# qc processing
krsa_qc_steps(data) -> data

# Define sample groups
data %>% mutate(Group = SampleName, SampleName = paste0(SampleName,"_" ,Barcode)) -> data

# extract end level signal values @ all max exposure time (200ms)
krsa_extractEndPointMaxExp(data, chipType) -> data_pw_200

# extract end level signal values end level @ all exposure times
krsa_extractEndPoint(data, chipType) -> data_pw

QC Steps and Model Fitting

We will filter out peptides with low signals. In order to combine the values from different exposure times into a single value, a simple linear regression model of the Medain_SigmBg as a function of exposure time is fitted. The slope of of the model fit and \(R^2\) are then used for quality control and samples comparison. The slope is also scaled by multiplying by 100 and log2 transformed (Slope_Transformed). We then filter out peptides with poor linear fit and references peptides.


# Filter out peptides with low signals
krsa_filter_lowPeps(data_pw_200, 5) -> ppPassAll
#> Filtered out 79 Peptides

# Fit the linear model
# This will produce a list of data frames:
# scaled: the Slope_Transformed values (see above for more info on Slope_Transformed)
# normalized: the Slope_Transformed values but normalized by Chip/Barcode
# Grouped: The mean of Slope_Transformed values for all samples within a group
krsa_scaleModel(data_pw, ppPassAll) -> data_modeled
  
# Filter out peptides weak linear fit
krsa_filter_nonLinear(data_modeled$scaled, 0.8) -> ppPassR2
#> Filtered out 0 Peptides

# Filter out reference peptides
krsa_filter_ref_pep(ppPassR2) -> new_pep
#> Filtered out 10 Peptides

Global Signal Intensity

For a global signal intensity across all samples/groups, few figures can be plotted based on the Slope_Transformed values.

Global CV Plots

We will plot the coefficient of variation on both the normal and normalized fits. This will help us to identify groups with high variation that could be explained by sample outliers.


# Plot a CV figure using the modeled scaled data
krsa_cv_plot(data_modeled$scaled, new_pep)
#> `geom_smooth()` using formula 'y ~ x'


# Plot a CV figure using the modeled normalized data
krsa_cv_plot(data_modeled$normalized, new_pep)
#> `geom_smooth()` using formula 'y ~ x'

Global Violin Plots

We will plot violin figures to examine global signal differences between groups/samples.

# Plot a violin figure and facet by the (Group) variable
krsa_violin_plot(data_modeled$scaled, new_pep, "Group")


# Plot a violin figure and facet by the (Barcode) variable
krsa_violin_plot(data_modeled$scaled, new_pep, "Barcode")

Global Heatmaps

The heatmap represent all the peptides present on the chip except the positive/internal controls and peptides that failed to pass QC. The heatmaps are scaled by row to highlight the peptide signal differences across the samples. A hierarchical unsupervised clustering is applied both on the peptides and the samples to potentially group similar signatures.


# Generates a heatmap using the modeled scaled data
krsa_heatmap(data_modeled$scaled, new_pep, scale = "row")


# Generates a heatmap using the modeled normalized data
krsa_heatmap(data_modeled$normalized, new_pep, scale = "row")


# Generates a heatmap using the modeled grouped data
krsa_heatmap_grouped(data_modeled$grouped, new_pep, scale = "row")

Group Comparison

To compare between samples, a two-group comparison is performed. In this case, the two group comparisons are:

B vs A

The Slope_Transforemed ratio between each group, paired by chip, is calculated to the fold change. Based on the fold change, peptides that pass a certain fold change threshold are considered significant hits. Also, quality control steps applied in each comparison to filter out peptides that do not reach specific criteria:

The Medain_SigmBg at max exposure 200ms must be above a certain value
\(R^2\) of the linear model fit must be above a threshold value

These Filtering Parameters (fold change threshold, QC criteria) can be modified to adjust the stringency of the analysis. The Filtering Parameters that are used for this analysis:

The Medain_SigmBg at max exposure 200ms must be equal or above 5
\(R^2\) of the linear model fit must be above or equal 0.8
Log fold change (LFC) cutoffs at (0.2,0.3)


# Define Groups to be compared
comparisons <- list(Comp1 = c("B", "A"))

# This function will run both QC steps (krsa_filter_lowPeps, krsa_filter_nonLinear) and krsa_filter_ref_pep
krsa_quick_filter(data = data_pw_200, data2 = data_modeled$scaled,
    signal_threshold = 5, r2_threshold = 0.9,
    groups = comparisons$Comp1) -> pep_passed_qc
#> Filtered out 65 Peptides
#> Filtered out 1 Peptides
#> Filtered out 10 Peptides

# This function calculates log2 fold change values between the defined groups
# The byChip argument lets you calculates the log2 fold change the results within each chip
krsa_group_diff(data_modeled$scaled, comparisons$Comp1, pep_passed_qc, byChip = T) -> diff_df

# save LFC table
#write_delim(diff_df, "comp1_LFC_df.txt", delim = "\t")

# Extract top peptides based on the LFC cutoff using average of LFCs across chips
krsa_get_diff(diff_df %>% mutate(LFC_SD = 0),LFC ,c(0.2,0.3,0.4)) %>% list("meanLFC" = .) -> sigPeps

# Extract top peptides based on the LFC cutoff using average of LFCs across chips with a standard deviation cutoff (sd_thr). in this example the sd_thr is set as 0.5 but that could be adjusted by user (uncomment next line to account for LFC sd)
# krsa_get_diff(diff_df,totalMeanLFC ,c(0.2,0.3), sd_thr = 0.5) %>% list("meanLFC" = .) -> sigPeps

# Extract top peptides based on the LFC cutoff using per chip LFCs
krsa_get_diff_byChip(diff_df, LFC, c(0.2,0.3)) -> sigPepsPerChip

# Combine the peptides hits in one list
sigPeps_total <- list(sigPeps, sigPepsPerChip) %>% unlist(recursive = F) %>%  unlist(recursive = F)

B vs A (Heatmap)

After applying the Filtering Parameters for this group comparison, only 56/141 peptides carried forward in the analysis (i.e. 56 hits). Below are some figures to visualize the differences between these samples for considering these hits.


# generates a heatmap using the selected groups and peptides
krsa_heatmap(data_modeled$normalized, sigPeps$meanLFC$`0.2`, groups = comparisons$Comp1, scale = "row")

B vs A (Violin Plot)

# generates a violin plot using the selected groups and peptides
krsa_violin_plot(data_modeled$scaled, sigPeps$meanLFC$`0.2`, "Barcode", groups = comparisons$Comp1)


# generate a grouped violin/boxplot plot using the selected groups and peptides with more options, like a statistical test. check krsa_violin_plot_grouped() arguments for more options
krsa_violin_plot_grouped(data_modeled$scaled, sigPeps$meanLFC$`0.2`, comparisons, 
                         dots = F,
                         groups = comparisons$Comp1, avg_line = T)



krsa_violin_plot_grouped(data_modeled$grouped, sigPeps$meanLFC$`0.2`, comparisons, 
                         test = F, 
                         groups = comparisons$Comp1, avg_line = T)

B vs A (Waterfall Plot)

This waterfall represents the log2 fold changes between the two groups at each peptide.

# generates a waterfall of the log2 fold change values for the selected peptide (top peptides)
krsa_waterfall(diff_df, lfc_thr = 0.2, byChip = T)


# generates a waterfall of the log2 fold change values for the selected peptide (top peptides) with a standard deviation cutoff (sd_thr). This is just for the waterfall not the actual upstream kinase analysis (that must be set with the krsa_get_diff function)
krsa_waterfall(diff_df, lfc_thr = 0.2, byChip = T, sd_thr = 0.5)

B vs A (Curve Plot)

We generate curves plots per peptide using the last cycle data to visualize the linear fit model

# generates a curve plot representing the linear model fit for the selected peptide (top peptides, only showing 10 here)
krsa_curve_plot(data_pw, sigPeps$meanLFC$`0.2`[1:10], groups = comparisons$Comp1)

B vs A (Upstream Kinase Analysis)

The lab carefully curated and mapped the kinases that can act and phosphorylate each peptide present on the chip. This was achieved by using multiple sources including GPS 3.0, Kinexus Phosphonet, PhosphoELM and PhosphoSite Plus. Based on that association between peptides and kinases, a random sampling analysis is performed for these hits. The basic idea of KRSA is: For each iteration (2000 iterations performed in this analysis), the same number of hits are randomly selected from the total 141/or 193 peptides present on the chip. Predicted kinases are then mapped to this sample list of peptides and number of kinases are determined. The kinase count from the actual hits and random sampling is then compared to determine the significance.



# load in chip coverage and kinase-substrate files OR upload your own files
# if PTK chip, use:
# chipCov <- KRSA_coverage_PTK_PamChip_86402_v1
# KRSA_file <- KRSA_Mapping_PTK_PamChip_86402_v1

# STK chip
chipCov <- KRSA_coverage_STK_PamChip_87102_v1
KRSA_file <- KRSA_Mapping_STK_PamChip_87102_v1



# run the KRSA function to do the random sampling analysis, set seed that can be used later to reproduce results, and choose number of iterations
krsa(sigPeps$meanLFC$`0.2`, return_count = T, seed = 123, itr = 2000,
     map_file = KRSA_file, cov_file = chipCov) -> fin

# View the Z score table
kable(head(fin$KRSA_Table,25), digits = 3)

Kinase	Observed	SamplingAvg	SD	Z
DYRK	13	21.376	2.877	-2.912
NMO	0	4.376	1.550	-2.822
DMPK	48	40.877	2.650	2.688
JNK	12	19.186	2.750	-2.613
PLK	11	18.206	2.762	-2.608
PIM	36	28.273	2.985	2.589
PKCH	11	6.436	1.839	2.481
AKT	22	15.884	2.616	2.338
RIPK	9	15.052	2.603	-2.326
CDK	34	39.757	2.621	-2.197
ERK	17	23.363	2.912	-2.186
MAPKAPK	25	19.443	2.726	2.039
P38	15	20.689	2.853	-1.994
RAD53	18	13.145	2.468	1.967
STE11	1	3.928	1.504	-1.947
STE7	15	20.259	2.803	-1.876
PKN	2	0.803	0.695	1.721
RSK	38	32.988	2.985	1.679
PAKB	2	0.821	0.704	1.674
AMPK	13	9.453	2.186	1.623
PAKA	17	13.136	2.490	1.552
IRAK	3	5.556	1.690	-1.512
FRAY	11	7.939	2.036	1.504
PDHK	29	24.715	2.909	1.473
PKA	28	23.789	2.967	1.419


# to save file
#fin$KRSA_Table %>% write_delim("acrossChip_KRSA_FullTable_comp1.txt", delim = "\t")

# find top and bottom kinases
bothways <- c(pull(head(fin$KRSA_Table, 10), Kinase), pull(tail(fin$KRSA_Table, 10), Kinase))

# Use these kinase to generate histogram plots for each selected kinase
krsa_histogram_plot(fin$KRSA_Table, fin$count_mtx, bothways)



# For parallel computing, load the furrr package:
# opens multiple R sessions to run faster
plan(multisession)

# Run the KRSA function across the different sets of peptides using the furrr package for parallel computing
future_map(sigPeps_total, krsa) -> mutiple_krsa_outputs


# For none parallel computing:
# Run KRSA function across the different sets of peptides
#map(sigPeps_total, krsa) -> mutiple_krsa_outputs

# Tidy output
df <- data.frame(matrix(unlist(mutiple_krsa_outputs), ncol = max(lengths(mutiple_krsa_outputs)), byrow = TRUE))
df <- setNames(do.call(rbind.data.frame, mutiple_krsa_outputs), names(mutiple_krsa_outputs$meanLFC.0.2))

df %>% rownames_to_column("method") %>% select(Kinase, Z, method) %>% 
  mutate(method = str_extract(method, "\\w+\\.\\w+\\.\\w+")) %>% 
  mutate(method = gsub("(^\\w+)[\\.]", "\\1>", method)) %>% 
  mutate_if(is.numeric, round, 2) -> df

df %>% 
  pivot_wider(names_from = method, values_from = Z) -> df2
  
# Generates a table of the Z scores across the different sets of peptides
# df2 %>% 
#   gt() %>% tab_spanner_delim(delim = ">")

# Creates an average Z score table using the across chip analysis
df %>% 
  filter(grepl("mean", method)) %>% 
  select(Kinase, Z, method) %>% group_by(Kinase) %>% mutate(AvgZ = mean(Z)) -> AvgZTable

# save file
#AvgZTable %>% write_delim("acrossChip_KRSA_Table_comp1.txt", delim = "\t")

# Creates an average Z score table using the within chip analysis
df %>% 
  filter(!grepl("mean", method)) %>% 
  select(Kinase, Z, method) %>% group_by(Kinase) %>% mutate(AvgZ = mean(Z)) -> AvgZTable2

# save file
#AvgZTable %>% write_delim("withinChip_KRSA_Table_comp1.txt", delim = "\t")

# Extract top kinases based on abs(Z) score
krsa_top_hits(AvgZTable2, 2) -> kinases_hits

#krsa_top_hits(AvgZTable2, 1.75)
#krsa_top_hits(AvgZTable2, 1.5)

# Show the number of peptides per each set in atable
krsa_show_peptides(sigPeps_total)
#> # A tibble: 9 × 2
#>   Method        NumberOfPeptides
#>   <chr>                    <int>
#> 1 meanLFC.0.2                 56
#> 2 meanLFC.0.3                 53
#> 3 meanLFC.0.4                 50
#> 4 710317402.0.2               43
#> 5 710317402.0.3               39
#> 6 710317404.0.2               51
#> 7 710317404.0.3               46
#> 8 710317405.0.2               49
#> 9 710317405.0.3               45

B vs A (Z Scores Plot)

We will plot the individual and averaged Z scores using both the across and within chip analyses.


# Generates Z scores waterfall plots
krsa_zscores_plot(AvgZTable)

krsa_zscores_plot(AvgZTable2)

B vs A (Reverse KRSA Plot)

We will use the reverse KRSA plot function, to plot the log2 fold chnage values for all peptides mapped to kinase hits. This will help us examine the activity of the kinase


# plot the reverse KRSA figure for top kinases to determine their activity levels
krsa_reverse_krsa_plot(chipCov, diff_df, kinases_hits, 0.2, byChip = F)

B vs A (Coverage Plot)

To view the coverage of kinases across the full list of peptides on the chip, we will use the coverage plot function


# generates a kinase covergae plot 
krsa_coverage_plot(chipCov, AvgZTable2, chipType)

B vs A (Ball Model Network)

We will view the ball model network function, to generate a model representing the protein-protein interactions between kinases


# Plot the network ball model
krsa_ball_model(kinases_hits, AvgZTable2, 10, 2.5, 4.8)

Session Info

installed.packages()[names(sessionInfo()$otherPkgs), "Version"]
#>     furrr    future        gt   forcats   stringr     dplyr     purrr     readr 
#>   "0.3.1"  "1.28.0"   "0.7.0"   "0.5.2"   "1.4.1"  "1.0.10"   "0.3.5"   "2.1.3" 
#>     tidyr    tibble   ggplot2 tidyverse     knitr      KRSA 
#>   "1.2.1"   "3.1.8"   "3.3.6"   "1.3.2"    "1.40"   "1.0.0"