This vignette shows how to estimate the uncertainty of the efflux and
production estimates using bootstrap_error()
. We will first
generate a dataset of ‘measurement uncertainty’ in the input parameters
and then use bootstrap_error()
to estimate the resulting
uncertainty in the models.
library(ConFluxPro)
#>
#> Attaching package: 'ConFluxPro'
#> The following object is masked from 'package:stats':
#>
#> filter
Uncertainty of input variables
General workflow and uncertainty from gasdata
The most relevant source of uncertainty in FGM-based models is that
of the input parameters. Especially parameters related to the estimation
of the calculation of the diffusion coefficient DS
are hard
to measure and soil heterogeneity introduces more uncertainty. We’ll use
the example dataset to demonstrate this effect.
In this dataset, the molar fraction of CO2 is measured in
three replicates at each depth within the soil. In a normal model, we
use all replicates to fit a single profile. By randomly sampling from
these replicates, we can estimate the uncertainty the concentration
profiles measurement introduces into the model. This can be done in
bootstrap_error()
set.seed(42) # to get exactly same result every time
mod_pf_bs_gasdata <-
bootstrap_error(mod_pf,
n_samples = 25,
sample_from = "gasdata")
#>
#> validating datasets
#> id_cols: site, Date, gas, bootstrap_id
#> 600 unique profiles
With n_samples = 25
we created 25 new models. In each of
these, a random selection of all measured x_ppm
values was
created per profile. This is done by resampling the same number of
observations at each depth while allowing replacing (meaning that a
single measurement may be sampled multiple times!). If we extract the
efflux, we now get a second column DELTA_efflux
that gives
the uncertainty. Notice, that the value of efflux
has
changed slightly from the original model. This is because it is now
estimated as the mean value of all 25 models, while the standard
deviation is the estimate of DELTA_efflux
.
## after bootstrapping
mod_pf_bs_gasdata %>%
filter(prof_id == 1) %>%
efflux()
#>
#> A cfp_profile object
#> id_cols: site Date gas
#> 1 unique profiles
#>
#> site Date gas efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8336188 1 0.01934041
## originial model
mod_pf %>%
filter(prof_id == 1) %>%
efflux()
#>
#> A cfp_profile object
#> id_cols: site Date gas
#> 1 unique profiles
#>
#> site Date gas efflux prof_id
#> 1 site_a 2021-01-01 CO2 0.8347288 1
Similarly, we can extract the production()
, where
respective columns have also been added.
mod_pf_bs_gasdata %>%
filter(prof_id == 1) %>%
production()
#>
#> A cfp_layered_profile object
#> id_cols: site Date gas
#> 1 unique profiles
#>
#> site Date gas upper lower layer prod_abs prod_rel efflux F0
#> 1 site_a 2021-01-01 CO2 0 -100 1 0.09728528 0.1167024 0.8336188 0
#> 2 site_a 2021-01-01 CO2 5 0 2 0.73633354 0.8832976 0.8336188 0
#> DELTA_F0 DELTA_prod_abs DELTA_prod_rel
#> 1 NA 0.003887647 0.007371137
#> 2 NA 0.022983446 0.048063678
Uncertainty from soilphys
However, not only the gas concentration data carries uncertainty.
Indeed, many parameters contained within the soilphys
dataset are subject to significant uncertainty due to sampling error and
soil heterogeneity. In the present dataset this is not represented at
all. So, let’s pretend that we measured TPS
in three
replicates identified by replicate_id
that have some spread
around the mean. We will do this by randomly sampling from a normal
distribution with a standard deviation of 0.01.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:ConFluxPro':
#>
#> n_groups
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
soilphys <- cfp_soilphys(base_dat)
set.seed(42)
soilphys_replicate_TPS <-
soilphys %>%
# get base TPS info
select(site, upper, lower, TPS) %>%
distinct() %>%
rename(TPS_mean = TPS) %>%
# repeat each row 3 times
cross_join(data.frame(replicate_id = 1:3)) %>%
# generate new, random TPS values
mutate(TPS = rnorm(n(), TPS_mean, 0.01)) %>%
# join with rest of dataset
left_join(soilphys %>% select(!TPS),
by = c("site", "upper", "lower"),
relationship = "many-to-many") %>%
# recaluclate DS and c_air
complete_soilphys(DSD0_formula = "a*AFPS^b", quiet = TRUE) %>%
cfp_soilphys(id_cols = c("site", "Date", "replicate_id"))
#>
#> added 'gas' to id_cols
Now we can create a new dataset that includes these new values of TPS
and create a cfp_pfmod
model. Of course you could use real
measurements instead and include the variability of other parameters
(SWC
, t
, …) as well. We don’t call
pro_flux()
, because we don’t want to calulcate the result
from each replication individual.
replicate_dat <-
cfp_dat(cfp_gasdata(base_dat),
soilphys_replicate_TPS,
cfp_layers_map(base_dat))
#>
#> validating datasets
#> id_cols: site, Date, gas, replicate_id
#> 72 unique profiles
mod_pf_replicate <-
cfp_pfmod(replicate_dat)
Now we can run bootstrap_error()
again. This time, we
will tell the function to resample from the soilphys
dataset when creating the bootstrapping runs. To do this, we need to
tell it which id_cols
define(s) the replications of the
dataset. In our case, this is replicate_id
. Here, only a
single value for each profile and depth is sampled for each run. The new
call looks like this:
set.seed(42)
mod_pf_bs_soilphys <-
bootstrap_error(mod_pf_replicate,
n_samples = 25,
sample_from = "soilphys",
rep_cols = "replicate_id")
#>
#> validating datasets
#> id_cols: site, Date, gas, bootstrap_id
#> 600 unique profiles
#>
#> validating datasets
#> id_cols: site, Date, gas
#> 24 unique profiles
If we take a look into the result of efflux()
again, we
can see that the uncertainty of the efflux estimate has increased
compared to using the variability in gasdata
. This is both
because TPS
, from which DS
is derived, has a
strong impact on the flux calculation and because the variability within
the gasdata
dataset is probably unrealistically low and
more representative of an instrument than of a measurement error.
## boostrapping from soilphys
mod_pf_bs_soilphys %>%
filter(prof_id < 10) %>%
efflux()
#>
#> A cfp_profile object
#> id_cols: site Date gas
#> 9 unique profiles
#>
#> site Date gas efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8284734 1 0.02369963
#> 2 site_b 2021-01-01 CO2 0.6809739 2 0.01377851
#> 3 site_a 2021-02-01 CO2 1.3522946 3 0.04108327
#> 4 site_b 2021-02-01 CO2 1.0733001 4 0.02497279
#> 5 site_a 2021-03-01 CO2 1.7396562 5 0.04583895
#> 6 site_b 2021-03-01 CO2 1.3916118 6 0.04032870
#> 7 site_a 2021-04-01 CO2 4.0887068 7 0.10509266
#> 8 site_b 2021-04-01 CO2 3.3623254 8 0.04875499
#> 9 site_a 2021-05-01 CO2 5.8339197 9 0.16272032
## boostrapping from gasdata
mod_pf_bs_gasdata %>%
filter(prof_id < 10) %>%
efflux()
#>
#> A cfp_profile object
#> id_cols: site Date gas
#> 9 unique profiles
#>
#> site Date gas efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8336188 1 0.019340412
#> 2 site_b 2021-01-01 CO2 0.6816792 2 0.016291276
#> 3 site_a 2021-02-01 CO2 1.3745789 3 0.017921161
#> 4 site_b 2021-02-01 CO2 1.0611182 4 0.012922019
#> 5 site_a 2021-03-01 CO2 1.7370006 5 0.038777233
#> 6 site_b 2021-03-01 CO2 1.3863835 6 0.009591992
#> 7 site_a 2021-04-01 CO2 4.1470956 7 0.047504882
#> 8 site_b 2021-04-01 CO2 3.3411440 8 0.047524641
#> 9 site_a 2021-05-01 CO2 5.9591679 9 0.063722814
Finally, we can also use the variability from both datasets at the same time. This runs both sampling strategies independent from another.
set.seed(42)
mod_pf_bs_both <-
bootstrap_error(mod_pf_replicate,
n_samples = 25,
sample_from = "both",
rep_cols = "replicate_id")
#>
#> validating datasets
#> id_cols: site, Date, gas, bootstrap_id
#> 600 unique profiles
#>
#> validating datasets
#> id_cols: site, Date, gas
#> 24 unique profiles
mod_pf_bs_both %>%
filter(prof_id < 10) %>%
efflux()
#>
#> A cfp_profile object
#> id_cols: site Date gas
#> 9 unique profiles
#>
#> site Date gas efflux prof_id DELTA_efflux
#> 1 site_a 2021-01-01 CO2 0.8240803 1 0.03005396
#> 2 site_b 2021-01-01 CO2 0.6831012 2 0.02640363
#> 3 site_a 2021-02-01 CO2 1.3486770 3 0.03307125
#> 4 site_b 2021-02-01 CO2 1.0755834 4 0.02088252
#> 5 site_a 2021-03-01 CO2 1.7209752 5 0.06241706
#> 6 site_b 2021-03-01 CO2 1.3986402 6 0.03195282
#> 7 site_a 2021-04-01 CO2 4.0818567 7 0.10536313
#> 8 site_b 2021-04-01 CO2 3.4030107 8 0.09897170
#> 9 site_a 2021-05-01 CO2 5.8525804 9 0.16324836