Uncertainty estimation

This vignette shows how to estimate the uncertainty of the efflux and production estimates using bootstrap_error(). We will first generate a dataset of ‘measurement uncertainty’ in the input parameters and then use bootstrap_error() to estimate the resulting uncertainty in the models.

library(ConFluxPro)
#> 
#> Attaching package: 'ConFluxPro'
#> The following object is masked from 'package:stats':
#> 
#>     filter

Uncertainty of input variables

General workflow and uncertainty from `gasdata`

The most relevant source of uncertainty in FGM-based models is that of the input parameters. Especially parameters related to the estimation of the calculation of the diffusion coefficient DS are hard to measure and soil heterogeneity introduces more uncertainty. We’ll use the example dataset to demonstrate this effect.

data("base_dat", package = "ConFluxPro")
mod_pf <- pro_flux(base_dat)

In this dataset, the molar fraction of CO₂ is measured in three replicates at each depth within the soil. In a normal model, we use all replicates to fit a single profile. By randomly sampling from these replicates, we can estimate the uncertainty the concentration profiles measurement introduces into the model. This can be done in bootstrap_error()

set.seed(42) # to get exactly same result every time 
mod_pf_bs_gasdata <- 
  bootstrap_error(mod_pf, 
                  n_samples = 25, 
                  sample_from = "gasdata")
#> 
#> validating datasets
#> id_cols: site, Date, gas, bootstrap_id
#> 600 unique profiles

With n_samples = 25 we created 25 new models. In each of these, a random selection of all measured x_ppm values was created per profile. This is done by resampling the same number of observations at each depth while allowing replacing (meaning that a single measurement may be sampled multiple times!). If we extract the efflux, we now get a second column DELTA_efflux that gives the uncertainty. Notice, that the value of efflux has changed slightly from the original model. This is because it is now estimated as the mean value of all 25 models, while the standard deviation is the estimate of DELTA_efflux.

## after bootstrapping
mod_pf_bs_gasdata %>%
  filter(prof_id == 1) %>%
  efflux()
#> 
#> A cfp_profile object 
#> id_cols: site Date gas 
#> 1  unique profiles 
#> 
#>     site       Date gas    efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8336188       1   0.01934041

## originial model
mod_pf %>%
  filter(prof_id == 1) %>%
  efflux()
#> 
#> A cfp_profile object 
#> id_cols: site Date gas 
#> 1  unique profiles 
#> 
#>     site       Date gas    efflux prof_id
#> 1 site_a 2021-01-01 CO2 0.8347288       1

Similarly, we can extract the production(), where respective columns have also been added.

mod_pf_bs_gasdata %>%
  filter(prof_id == 1) %>%
  production()
#> 
#> A cfp_layered_profile object 
#> id_cols: site Date gas 
#> 1  unique profiles 
#> 
#>     site       Date gas upper lower layer   prod_abs  prod_rel    efflux F0
#> 1 site_a 2021-01-01 CO2     0  -100     1 0.09728528 0.1167024 0.8336188  0
#> 2 site_a 2021-01-01 CO2     5     0     2 0.73633354 0.8832976 0.8336188  0
#>   DELTA_F0 DELTA_prod_abs DELTA_prod_rel
#> 1       NA    0.003887647    0.007371137
#> 2       NA    0.022983446    0.048063678

Uncertainty from `soilphys`

However, not only the gas concentration data carries uncertainty. Indeed, many parameters contained within the soilphys dataset are subject to significant uncertainty due to sampling error and soil heterogeneity. In the present dataset this is not represented at all. So, let’s pretend that we measured TPS in three replicates identified by replicate_id that have some spread around the mean. We will do this by randomly sampling from a normal distribution with a standard deviation of 0.01.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:ConFluxPro':
#> 
#>     n_groups
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
soilphys <- cfp_soilphys(base_dat)

set.seed(42)
soilphys_replicate_TPS <-
soilphys %>%
  # get base TPS info
  select(site, upper, lower, TPS) %>%
  distinct() %>%
  rename(TPS_mean = TPS) %>%
  # repeat each row 3 times 
  cross_join(data.frame(replicate_id = 1:3)) %>%
  # generate new, random TPS values
  mutate(TPS = rnorm(n(), TPS_mean, 0.01)) %>%
  # join with rest of dataset
  left_join(soilphys %>% select(!TPS),
            by = c("site", "upper", "lower"),
            relationship = "many-to-many") %>%
  # recaluclate DS and c_air
  complete_soilphys(DSD0_formula = "a*AFPS^b", quiet = TRUE) %>%
  cfp_soilphys(id_cols = c("site", "Date", "replicate_id"))
#> 
#> added 'gas' to id_cols

Now we can create a new dataset that includes these new values of TPS and create a cfp_pfmod model. Of course you could use real measurements instead and include the variability of other parameters (SWC, t, …) as well. We don’t call pro_flux(), because we don’t want to calulcate the result from each replication individual.

replicate_dat <- 
  cfp_dat(cfp_gasdata(base_dat),
          soilphys_replicate_TPS,
          cfp_layers_map(base_dat))
#> 
#> validating datasets
#> id_cols: site, Date, gas, replicate_id
#> 72 unique profiles

mod_pf_replicate <-
  cfp_pfmod(replicate_dat)

Now we can run bootstrap_error() again. This time, we will tell the function to resample from the soilphys dataset when creating the bootstrapping runs. To do this, we need to tell it which id_cols define(s) the replications of the dataset. In our case, this is replicate_id. Here, only a single value for each profile and depth is sampled for each run. The new call looks like this:

set.seed(42)
mod_pf_bs_soilphys <- 
  bootstrap_error(mod_pf_replicate, 
                  n_samples = 25, 
                  sample_from = "soilphys",
                  rep_cols = "replicate_id")
#> 
#> validating datasets
#> id_cols: site, Date, gas, bootstrap_id
#> 600 unique profiles
#> 
#> validating datasets
#> id_cols: site, Date, gas
#> 24 unique profiles

If we take a look into the result of efflux() again, we can see that the uncertainty of the efflux estimate has increased compared to using the variability in gasdata. This is both because TPS, from which DS is derived, has a strong impact on the flux calculation and because the variability within the gasdata dataset is probably unrealistically low and more representative of an instrument than of a measurement error.

## boostrapping from soilphys
mod_pf_bs_soilphys %>%
  filter(prof_id < 10) %>%
  efflux()
#> 
#> A cfp_profile object 
#> id_cols: site Date gas 
#> 9  unique profiles 
#> 
#>     site       Date gas    efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8284734       1   0.02369963
#> 2 site_b 2021-01-01 CO2 0.6809739       2   0.01377851
#> 3 site_a 2021-02-01 CO2 1.3522946       3   0.04108327
#> 4 site_b 2021-02-01 CO2 1.0733001       4   0.02497279
#> 5 site_a 2021-03-01 CO2 1.7396562       5   0.04583895
#> 6 site_b 2021-03-01 CO2 1.3916118       6   0.04032870
#> 7 site_a 2021-04-01 CO2 4.0887068       7   0.10509266
#> 8 site_b 2021-04-01 CO2 3.3623254       8   0.04875499
#> 9 site_a 2021-05-01 CO2 5.8339197       9   0.16272032

## boostrapping from gasdata
mod_pf_bs_gasdata %>%
  filter(prof_id < 10) %>%
  efflux()
#> 
#> A cfp_profile object 
#> id_cols: site Date gas 
#> 9  unique profiles 
#> 
#>     site       Date gas    efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8336188       1  0.019340412
#> 2 site_b 2021-01-01 CO2 0.6816792       2  0.016291276
#> 3 site_a 2021-02-01 CO2 1.3745789       3  0.017921161
#> 4 site_b 2021-02-01 CO2 1.0611182       4  0.012922019
#> 5 site_a 2021-03-01 CO2 1.7370006       5  0.038777233
#> 6 site_b 2021-03-01 CO2 1.3863835       6  0.009591992
#> 7 site_a 2021-04-01 CO2 4.1470956       7  0.047504882
#> 8 site_b 2021-04-01 CO2 3.3411440       8  0.047524641
#> 9 site_a 2021-05-01 CO2 5.9591679       9  0.063722814

Finally, we can also use the variability from both datasets at the same time. This runs both sampling strategies independent from another.

set.seed(42)
mod_pf_bs_both <- 
  bootstrap_error(mod_pf_replicate, 
                  n_samples = 25, 
                  sample_from = "both",
                  rep_cols = "replicate_id")
#> 
#> validating datasets
#> id_cols: site, Date, gas, bootstrap_id
#> 600 unique profiles
#> 
#> validating datasets
#> id_cols: site, Date, gas
#> 24 unique profiles
mod_pf_bs_both %>%
  filter(prof_id < 10) %>%
  efflux()
#> 
#> A cfp_profile object 
#> id_cols: site Date gas 
#> 9  unique profiles 
#> 
#>     site       Date gas    efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8240803       1   0.03005396
#> 2 site_b 2021-01-01 CO2 0.6831012       2   0.02640363
#> 3 site_a 2021-02-01 CO2 1.3486770       3   0.03307125
#> 4 site_b 2021-02-01 CO2 1.0755834       4   0.02088252
#> 5 site_a 2021-03-01 CO2 1.7209752       5   0.06241706
#> 6 site_b 2021-03-01 CO2 1.3986402       6   0.03195282
#> 7 site_a 2021-04-01 CO2 4.0818567       7   0.10536313
#> 8 site_b 2021-04-01 CO2 3.4030107       8   0.09897170
#> 9 site_a 2021-05-01 CO2 5.8525804       9   0.16324836

Uncertainty of input variables

General workflow and uncertainty from gasdata

Uncertainty from soilphys

General workflow and uncertainty from `gasdata`

Uncertainty from `soilphys`