Skip to contents

Introduction

This vignette provides users with an example analysis using the serocalculator package by reproducing the analysis for: Estimating typhoid incidence from community-based serosurveys: a multicohort study (Aiemjoy et al. (2022)). We review the methods underlying the analysis and then walk through an example of enteric fever incidence in Pakistan. Note that because this is a simplified version of the analysis, the results here will differ slightly from those presented in the publication.

In this example, users will determine the seroincidence of enteric fever in cross-sectional serosurveys conducted as part of the SeroEpidemiology and Environmental Surveillance (SEES) for enteric fever study in Bangladesh, Nepal, and Pakistan. Longitudinal antibody responses were modeled from 1420 blood culture-confirmed enteric fever cases enrolled from the same countries.

Methods

Load packages

The first step in conducting this analysis is to load our necessary packages. If you haven’t installed already, you will need to do so before loading:

install.packages("serocalculator")

See the Installation instructions for more details.

Once the serocalculator package has been installed, load it into your R environment using library(), along with any other packages you may need for data management; For this example, we will load tidyverse and forcats:

Load data

Pathogen-specific sample datasets, noise parameters, and longitudinal antibody dynamics for serocalculator are available on the Serocalculator Data Repository on Open Science Framework (OSF). We will pull this data directly into our R environment.

Note that each dataset has specific formatting and variable name requirements.

Load and prepare longitudinal parameter data

We will first load the longitudinal curve parameters to set the antibody decay parameters. In this example, these parameters were modeled with Bayesian hierarchical models to fit two-phase power-function decay models to the longitudinal antibody responses among confirmed enteric fever cases.

Formatting Specifications: Data should be imported as a “wide” dataframe with one column for each parameter and one row for each iteration of the posterior distribution for each antigen isotype. Column names must exactly match follow the naming conventions:

Column Name Description
y0 Baseline antibody concentration
y1 Peak antibody concentration
t1 Time to peak antibody concentration
alpha Antibody decay rate
r Antibody decay shape

Note that variable names are case-sensitive

# Import longitudinal antibody parameters from OSF
curves <-
  "https://osf.io/download/rtw5k/" %>%
  load_curve_params()

Visualize curve parameters

We can graph the decay curves with an autoplot() method:

# Visualize curve parameters
curves %>% filter(antigen_iso == "HlyE_IgA" |
                    antigen_iso == "HlyE_IgG") %>%
  autoplot()

Load and prepare cross-sectional data

Next, we load our sample cross-sectional data. We will use a subset of results from the SEES dataset. Ideally, this will be a representative sample of the general population without regard to disease status. Later, we will limit our analysis to cross-sectional data from Pakistan.

We have selected hemolysin E (HlyE) as our target antigen and IgG and IgA as our target immunoglobulin isotypes. Users may select different serologic markers depending on what is available in your data. From the original dataset, we rename our result and age variables to the names required by serocalculator.

Formatting Specifications: Cross-sectional population data should be a “long” dataframe with one column for each variable and one row for each antigen isotype resulted for an individual. So the same individual will have more than one row if they have results for more than one antigen isotype. The dataframe can have additional variables, but the two below are required:

Column Name Description
value Quantitative antibody response
age Numeric age

Note that variable names are case sensitive

#Import cross-sectional data from OSF and rename required variables
xs_data <- readr::read_rds("https://osf.io/download//n6cp3/") %>%
  as_pop_data()

Summarize antibody data

We can compute numerical summaries of our cross-sectional antibody data with a summary() method for pop_data objects:

xs_data %>% summary()
#> 
#> n = 3336 
#> 
#> Distribution of age: 
#> 
#> # A tibble: 1 × 7
#>       n   min first_quartile median  mean third_quartile   max
#>   <int> <dbl>          <dbl>  <dbl> <dbl>          <dbl> <dbl>
#> 1  3336   0.6              5     10  10.5             15    25
#> 
#> Distributions of antigen-isotype measurements:
#> 
#> # A tibble: 2 × 7
#>   antigen_iso   Min `1st Qu.` Median `3rd Qu.`   Max `# NAs`
#>   <fct>       <dbl>     <dbl>  <dbl>     <dbl> <dbl>   <int>
#> 1 HlyE_IgA        0     0.851   1.74      3.66  133.       0
#> 2 HlyE_IgG        0     1.15    2.70      6.74  219.       0

Visualize antibody data

We examine our cross-sectional antibody data by visualizing the distribution of quantitative antibody responses. Here, we will look at the distribution of our selected antigen and isotype pairs, HlyE IgA and HlyE IgG, across participating countries.


#color palette
country_pal <- c("#EA6552", "#8F4B86", "#0099B4FF")

xs_data %>% autoplot(strata = "Country", type = "density") +
  scale_fill_manual(values = country_pal)

We see that across countries, our data is highly skewed with the majority of responses on the lower end of our data with long tails. Let’s get a better look at the distribution by log transforming our antibody response value.

# Create log transformed plots

xs_data %>%
  mutate(Country = fct_relevel(Country, "Bangladesh", "Pakistan", "Nepal")) %>%
  autoplot(strata = "Country", type = "density") +
  scale_fill_manual(values = country_pal) +
  scale_x_log10(labels = scales::label_comma())
#> Warning in scale_x_log10(labels = scales::label_comma()): log-10
#> transformation introduced infinite values.
#> Warning: Removed 18 rows containing non-finite outside the scale range
#> (`stat_density()`).

Once log transformed, our data looks much more normally distributed. In most cases, log transformation will be the best way to visualize serologic data.

Let’s also take a look at how antibody responses change by age.


#Plot antibody responses by age
ggplot(data = xs_data %>%
  mutate(Country = fct_relevel(
    Country, "Bangladesh", "Pakistan", "Nepal"
  )),
aes(x = age, y = value, color = Country)) +
  geom_point(size = 0.6, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_y_log10() +
  theme_linedraw() +
  labs(title = "Quantitative Antibody Results by Age", x = "Age", y = "Value") +
  scale_color_manual(values = country_pal)

In this plot, a steeper slope indicates a higher incidence. We can see that the highest burden is in Bangladesh. Nepal has a slightly higher incidence in the older group (higher slope).

Load noise parameters

Next, we must set conditions based on some assumptions about the data and errors that may need to be accounted for. This will differ based on background knowledge of the data.

The biological noise, ν\nu (“nu”), represents error from cross-reactivity to other antibodies. It is defined as the 95th percentile of the distribution of antibody responses to the antigen-isotype in a population with no exposure.

Measurement noise, ε\varepsilon (“epsilon”), represents measurement error from the laboratory testing process. It is defined by a CV (coefficient of variation) as the ratio of the standard deviation to the mean for replicates. Note that the CV should ideally be measured across plates rather than within the same plate.

Formatting Specifications: Noise parameter data should be a dataframe with one row for each antigen isotype and columns for each noise parameter below.

Column Name Description
y.low Lower limit of detection of the assay
nu Biologic noise
y.high Upper limit of detection of the assay
eps Measurement noise

Note that variable names are case-sensitive.

# Import noise parameters from OSF

noise <- url("https://osf.io/download//hqy4v/") %>% readRDS()

Estimate Seroincidence

Now we are ready to begin estimating seroincidence. We will conduct two separate analyses using two distinct functions, est.incidence and est.incidence.by, to calculate the overall seroincidence and the stratified seroincidence, respectively.

Overall Seroincidence

Using the function est.incidence, we filter to sites in Pakistan and define the datasets for our cross-sectional data (pop_data), longitudinal parameters (curve_param), and noise parameters (noise_param). We also define the antigen-isotype pairs to be included in the estimate (antigen_isos). Here, we have chosen to use two antigen isotypes, but users can add additional pairs if available.

# Using est.incidence (no strata)

est1 <- est.incidence(
  pop_data = xs_data %>% filter(Country == "Pakistan"),
  curve_param = curves,
  noise_param = noise %>% filter(Country == "Pakistan"),
  antigen_isos = c("HlyE_IgG", "HlyE_IgA")
)

summary(est1)
#> # A tibble: 1 × 10
#>   est.start incidence.rate      SE CI.lwr CI.upr coverage log.lik iterations
#>       <dbl>          <dbl>   <dbl>  <dbl>  <dbl>    <dbl>   <dbl>      <int>
#> 1       0.1          0.128 0.00682  0.115  0.142     0.95  -2376.          4
#> # ℹ 2 more variables: antigen.isos <chr>, nlm.convergence.code <ord>

Stratified Seroincidence

We can also produce stratified seroincidence estimates. Users can select one or more stratification variables in their cross-sectional population dataset. Let’s compare estimates across all countries and by age group.

#Using est.incidence.by (strata)

est_country_age <- est.incidence.by(
  strata = c("Country", "ageCat"),
  pop_data = xs_data,
  curve_params = curves,
  curve_strata_varnames = NULL,
  noise_params = noise,
  noise_strata_varnames = "Country",
  antigen_isos = c("HlyE_IgG", "HlyE_IgA"),
  num_cores = 8 # Allow for parallel processing to decrease run time
)
#> Warning in count_strata(., strata_varnames): The number of observations in `data` varies between antigen isotypes, for at least one stratum. Sample size for each stratum will be calculated as the minimum number of observations across all antigen isotypes.
#> Warning in check_parallel_cores(.): This computer appears to have 4 cores
#> available. `est.incidence.by()` has reduced its `num_cores` argument to 3 to
#> avoid destabilizing the computer.

summary(est_country_age)
#> Seroincidence estimated given the following setup:
#> a) Antigen isotypes   : HlyE_IgG, HlyE_IgA 
#> b) Strata       : Country, ageCat 
#> 
#>  Seroincidence estimates:
#> # A tibble: 9 × 14
#>   Stratum   Country  ageCat     n est.start incidence.rate      SE CI.lwr CI.upr
#>   <chr>     <chr>    <fct>  <int>     <dbl>          <dbl>   <dbl>  <dbl>  <dbl>
#> 1 Stratum 1 Banglad… <5       101       0.1         0.400  0.0395  0.330  0.485 
#> 2 Stratum 2 Banglad… 5-15     256       0.1         0.477  0.0320  0.418  0.544 
#> 3 Stratum 3 Banglad… 16+       44       0.1         0.449  0.0763  0.322  0.627 
#> 4 Stratum 4 Nepal    <5       171       0.1         0.0203 0.00444 0.0132 0.0311
#> 5 Stratum 5 Nepal    5-15     378       0.1         0.0355 0.00311 0.0299 0.0421
#> 6 Stratum 6 Nepal    16+      211       0.1         0.0935 0.00776 0.0795 0.110 
#> 7 Stratum 7 Pakistan <5       126       0.1         0.106  0.0136  0.0823 0.136 
#> 8 Stratum 8 Pakistan 5-15     261       0.1         0.115  0.00845 0.0991 0.132 
#> 9 Stratum 9 Pakistan 16+      107       0.1         0.190  0.0204  0.154  0.235 
#> # ℹ 5 more variables: coverage <dbl>, log.lik <dbl>, iterations <int>,
#> #   antigen.isos <chr>, nlm.convergence.code <ord>

Note that we get a warning about uneven observations between antigen isotypes, meaning some participants do not have results for both HlyE IgA and HlyE IgG. The warning indicates that the “Sample size for each stratum will be calculated as the minimum number of observations across all antigen isotypes”, so only participants with both antigen isotypes are included. To avoid this, filter the dataset to include only records with all specified antigen isotypes.

We set curve_strata_varnames = NULL to avoid stratification in the “curves” dataset because it does not include our strata variables. Without this, a warning appears: “curve_params is missing all strata variables, and will be used unstratified”. To stratify based on variables that exist in a longitudinal curve parameters dataset, specify variables using curve_strata_varnames, similar to how noise_strata_varnames is used for “noise” above.

Finally, let’s visualize our seroincidence estimates by country and age category.

# Plot seroincidence estimates

## Save summary(est_country_age) as a dataframe and sort by incidence rate
est_country_agedf <- summary(est_country_age) %>%
  mutate(
    Country = fct_relevel(Country, "Bangladesh", "Pakistan", "Nepal"),
    ageCat = factor(ageCat)
  )

## Create plot by country and age category
ggplot(est_country_agedf,
       aes(
         y = fct_rev(ageCat),
         x = incidence.rate * 1000, #rescale incidence
         fill = Country
       )) +
  geom_bar(stat = "identity",
           position = position_dodge(),
           show.legend = TRUE) +
  geom_errorbar(
    aes(xmin = CI.lwr * 1000, xmax = CI.upr * 1000), #rescale CIs
    position = position_dodge(width = 0.9),
    width = .2
  ) +
  labs(title = "Enteric Fever Seroincidence by Country and Age",
       x = "Seroincidence rate per 1000 person-years",
       y = "Age Category",
       fill = "Country") +
  theme_linedraw() +
  theme(axis.text.y = element_text(size = 11),
        axis.text.x = element_text(size = 11)) +
  scale_x_continuous(expand = c(0, 10)) +
  scale_fill_manual(values = country_pal)

Conclusions

We find that Bangladesh has the highest overall seroincidence of enteric fever with a rate of 449 per 1000 person-years, as well as the highest seroincidence by age category. In comparison, Nepal has a seroincidence rate over 1 times lower than that of Bangladesh (400 per 1000 person-years) and the lowest age-specific seroincidence rates of the three countries in the study. serocalculator provides an efficient tool to conduct this analysis and produce actionable results.

Acknowledgments

We gratefully acknowledge the study participants for their valuable time and interest in participating in these studies. Special thanks to our collaborators at Sabin Vaccine Institute, Aga Khan University (Karachi, Pakistan), Child Health Research Foundation (Dhaka, Bangladesh), and Dhulikhel Hospital, Kathmandu University Hospital (Dhulikhel, Nepal).

Funding

This project was supported by grants from the National Institutes of Health (NIH) National Institute of Allergy and Infectious Disease (R21AI176416), the NIH Fogarty International Center (K01TW012177) and the Bill and Melinda Gates Foundation.

References

Aiemjoy, K., Seidman J. C., Saha S., Munira S. J., Islam Sajib M. S., and Sarkar Sium S. M. al. 2022. “Estimating Typhoid Incidence from Community-Based Serosurveys: A Multicohort Study.” The Lancet Microbe 3 (8): e578–87. https://doi.org/10.1016/S2666-5247(22)00114-8.