Data preparation

This vignette introduces the central data classes in ConFluxPro and shows how to combine all necessary data into a single cfp_dat() object. We start by explaining the concept of a ‘profile’ and how soil data is structured in the package. We then introduce three essential classes cfp_gasdata(), cfp_soilphys() and cfp_layers_map() that frame different data types within ConFluxPro. Finally, we combine these objects into a single cfp_dat() object that holds all information needed to start modeling soil gas fluxes.

library(ConFluxPro)

Soil data and profiles

What is a soil profile?

A profile in ConFluxPro is a distinct observation of the vertical profile of a soil. For example, this can mean the concentration profile of CO₂ at a specific time point, or the vertical distribution of soil moisture. Profiles are identified by their id_cols, column names that together uniquely identify each profile. This can be whatever combination of columns happened to be in your data, but will typically include a datetime column to identify the time point or a column that identifies the site if you measured at multiple locations. Consider the soilphys dataset included in the package.

library(dplyr)
data("soilphys", package = "ConFluxPro")
soilphys %>%
  relocate(site, Date, gas) %>%
  head()
#>     site       Date gas  TPS   a   b depth upper lower        SWC            t
#> 1 site_a 2021-01-01 CO2 0.38 1.2 1.5   -80   -60  -100 0.16135277 -0.789903514
#> 2 site_a 2021-02-01 CO2 0.38 1.2 1.5   -80   -60  -100 0.12392743  0.005814771
#> 3 site_a 2021-03-01 CO2 0.38 1.2 1.5   -80   -60  -100 0.12227893  2.059937865
#> 4 site_a 2021-04-01 CO2 0.38 1.2 1.5   -80   -60  -100 0.14464345  4.725217894
#> 5 site_a 2021-05-01 CO2 0.38 1.2 1.5   -80   -60  -100 0.11446588  8.825682240
#> 6 site_a 2021-06-01 CO2 0.38 1.2 1.5   -80   -60  -100 0.09537391 10.377836383
#>      p      AFPS      DSD0           D0           DS    c_air
#> 1 1013 0.2186472 0.1226866 1.373780e-05 1.685444e-06 44.73588
#> 2 1013 0.2560726 0.1554984 1.381053e-05 2.147515e-06 44.60556
#> 3 1013 0.2577211 0.1570023 1.399908e-05 2.197889e-06 44.27263
#> 4 1013 0.2353565 0.1370158 1.424543e-05 1.951850e-06 43.84798
#> 5 1013 0.2655341 0.1641957 1.462819e-05 2.401886e-06 43.21035
#> 6 1013 0.2846261 0.1822189 1.477426e-05 2.692149e-06 42.97380

This dataset contains information on different soil physical parameters. We will come back to that later. For now, notice that there are the site and Date column. All combinations identify a unique profile, namely twelve dates for two sites "site_a" and "site_b" for a total of 24 profiles. We can create a cfp_profile() object from this dataset. The only thing this does is to add the information which columns identify the profiles to the dataset with the id_cols argument. Internally, the number of distinct profiles is calculated as the unique combinations of the id_cols and can be accessed by n_profiles().

soilphys <- 
  cfp_profile(soilphys, id_cols = c("site", "Date"))
n_profiles(soilphys)
#> [1] 24

The concept of profiles and id_cols applies to almost all objects created in ConFluxPro. You can always see what id_cols an object has by calling the function cfp_id_cols().

cfp_id_cols(soilphys)
#> [1] "site" "Date"

Soil data structure

Within a soil profile, parameters change with depth. In ConFluxPro we differentiate between two types of data. Point data have distinct values at a precise depth in the soil, such as the soil gas concentration. For layered data , the soil is subdivided into layers that are delimited by their upper and lower boundary. Layered data must be without gaps or overlaps, so that the entire profile can be described by distinct values. depth, upper and lower are in centimeters and a higher value indicates the upward direction. Both positive and negative values are allowed. This means you could set the mineral/organic soil boundary to have depth = 0, describe the mineral soil with increasingly negative values of depth and have the organic layer on top have positive values.

We will now present the three main data types that are needed to model gas fluxes.

`cfp_gasdata()`: Gas concentration data

Soil gas concentration data is stored in a data.frame. The data needs to have at least the columns gas, depth and x_ppm, as well as any id_cols that identify the soil profiles. The gas is a character() that identifies which gas was measured, so for example "CO2". depth is as explained above and x_ppm is the mixing ratio in ppm. Consider the example data that ships with the package:

data("gasdata", package = "ConFluxPro")
head(gasdata)
#> # A tibble: 6 × 5
#>   site   Date       depth x_ppm gas  
#>   <chr>  <date>     <dbl> <dbl> <chr>
#> 1 site_a 2021-01-01     5  423. CO2  
#> 2 site_a 2021-01-01     0  552. CO2  
#> 3 site_a 2021-01-01     0  556. CO2  
#> 4 site_a 2021-01-01     0  557. CO2  
#> 5 site_a 2021-01-01   -10  716. CO2  
#> 6 site_a 2021-01-01   -10  713. CO2

Because this data.frame already follows the requirements defined above, we can create a cfp_gasdata() object right away by providing the necessary id_cols. The function checks, whether the data.frame meets the criteria and returns an object of class cfp_gasdata.

# create working cfp_gasdata object
my_gasdata <- cfp_gasdata(gasdata, id_cols = c("site", "Date"))
#> 
#> added 'gas' to id_cols

head(my_gasdata)
#> 
#> A cfp_gasdata object 
#> id_cols: site Date gas 
#> 1  unique profiles 
#> 
#>     site       Date depth    x_ppm gas
#> 1 site_a 2021-01-01     5 422.7419 CO2
#> 2 site_a 2021-01-01     0 552.1412 CO2
#> 3 site_a 2021-01-01     0 555.8525 CO2
#> 4 site_a 2021-01-01     0 556.9315 CO2
#> 5 site_a 2021-01-01   -10 716.4256 CO2
#> 6 site_a 2021-01-01   -10 713.3633 CO2

# won't work, because depth is missing
gasdata_broken <- gasdata %>% select(!depth)
cfp_gasdata(gasdata_broken, id_cols = c("site", "Date"))
#> 
#> added 'gas' to id_cols
#> Error in validate_cfp_gasdata(x): data.frame lacks obligatory columns

The data is in a long format, following the tidy data principles, where each observation is in a new row. If your data is not in this format, you can use the tidyr::pivot_longer() and other functions in the tidyr package. Notice that cfp_gasdata() automatically added the gas column to the id_cols. gas is always a required column in ConFluxPro, so that many functions will automatically add it to the id_cols if its present and will fail if its missing.

`cfp_soilphys()`: Soil physical data

Create a `cfp_soilphys()` object

Soil physical data includes everything else needed to describe the gas transport characteristics of the soil. This data is split into homogeneous layers with constant values for each layer and profile. At the minimum, cfp_soilphys() requires two parameters: DS is the gas-specific apparent diffusion coefficient of the soil in m² s^-1 and c_air is the number density of the soil air in mol m^-3. If these parameters are provided, a valid cfp_soilphys object can be created. This function ensures that each profile is correctly layered without overlaps or gaps and that the required columns are present.

data("soilphys", package = "ConFluxPro")

my_soilphys <- cfp_soilphys(soilphys, id_cols = c("site", "Date"))
#> 
#> added 'gas' to id_cols

Derive parameters with `complete_soilphys()`

Often DS and c_air are derived from other parameters that are easier to measure. This then requires a bit more work to get a working cfp_soilphys() object. Typically, you will need the following information:

Measured parameters:

TPS: total porosity as a volume fraction.
SWC: soil water content as a volume fraction.
t: air temperature in °C.
p: air pressure in hPa.

Derived parameters:

c_air: air number density, derived from t and p with ideal gas law.
AFPS: air-filled pore space, defined as AFPS = TPS - SWC.
DSD0: relative diffusivity of the soil, unit-less. Derived from AFPS with transfer function.
D0: gas-specific free-air diffusion coefficient in m² s^-1. Derived from t and p with empiric function.
DS: gas-specific apparent diffusion coefficient in m² s^-1, derived as DS = D0 * DSD0.

These relationships are invoked within the complete_soilphys() function. Consider the provided example data: It already has every parameter we need. Now let’s remove the derived parameters and run complete_soilphys() again. We have to provide a formula for the calculation of the relative diffusivity DSD0. In our case, we have fitting parameters a and b from a Currie-style model that calculates as DSD0 = a * AFPS ^ b. We provide this formula as a character() to the function:

soilphys_measured_only <- 
  soilphys %>% 
  select(!all_of(c("AFPS", "DSD0", "D0", "DS", "c_air")))

head(soilphys_measured_only)
#>     site  TPS   a   b depth upper lower       Date        SWC            t    p
#> 1 site_a 0.38 1.2 1.5   -80   -60  -100 2021-01-01 0.16135277 -0.789903514 1013
#> 2 site_a 0.38 1.2 1.5   -80   -60  -100 2021-02-01 0.12392743  0.005814771 1013
#> 3 site_a 0.38 1.2 1.5   -80   -60  -100 2021-03-01 0.12227893  2.059937865 1013
#> 4 site_a 0.38 1.2 1.5   -80   -60  -100 2021-04-01 0.14464345  4.725217894 1013
#> 5 site_a 0.38 1.2 1.5   -80   -60  -100 2021-05-01 0.11446588  8.825682240 1013
#> 6 site_a 0.38 1.2 1.5   -80   -60  -100 2021-06-01 0.09537391 10.377836383 1013
#>   gas
#> 1 CO2
#> 2 CO2
#> 3 CO2
#> 4 CO2
#> 5 CO2
#> 6 CO2

my_soilphys_completed <- 
  complete_soilphys(soilphys_measured_only, DSD0_formula = "a*AFPS^b")
#> The following columns were added: AFPS DSD0 D0 DS c_air
#> The following columns were overwritten:

head(my_soilphys_completed)
#>     site  TPS   a   b depth upper lower       Date        SWC            t    p
#> 1 site_a 0.38 1.2 1.5   -80   -60  -100 2021-01-01 0.16135277 -0.789903514 1013
#> 2 site_a 0.38 1.2 1.5   -80   -60  -100 2021-02-01 0.12392743  0.005814771 1013
#> 3 site_a 0.38 1.2 1.5   -80   -60  -100 2021-03-01 0.12227893  2.059937865 1013
#> 4 site_a 0.38 1.2 1.5   -80   -60  -100 2021-04-01 0.14464345  4.725217894 1013
#> 5 site_a 0.38 1.2 1.5   -80   -60  -100 2021-05-01 0.11446588  8.825682240 1013
#> 6 site_a 0.38 1.2 1.5   -80   -60  -100 2021-06-01 0.09537391 10.377836383 1013
#>   gas      AFPS      DSD0           D0           DS    c_air
#> 1 CO2 0.2186472 0.1226866 1.373780e-05 1.685444e-06 44.73588
#> 2 CO2 0.2560726 0.1554984 1.381053e-05 2.147515e-06 44.60556
#> 3 CO2 0.2577211 0.1570023 1.399908e-05 2.197889e-06 44.27263
#> 4 CO2 0.2353565 0.1370158 1.424543e-05 1.951850e-06 43.84798
#> 5 CO2 0.2655341 0.1641957 1.462819e-05 2.401886e-06 43.21035
#> 6 CO2 0.2846261 0.1822189 1.477426e-05 2.692149e-06 42.97380

Notice the gas column. This is essential because different gases have specific diffusion coefficients. We can also remove the gas column and provide a list of gases to complete_soilphys():

soilphys_measured_only %>%
  select(!gas) %>%
  complete_soilphys(DSD0_formula = "a*AFPS^b", gases = c("CO2", "CH4")) %>%
  head()
#> The following columns were added: AFPS DSD0 D0 DS c_air
#> The following columns were overwritten:
#>     site  TPS   a   b depth upper lower       Date       SWC            t    p
#> 1 site_a 0.38 1.2 1.5   -80   -60  -100 2021-01-01 0.1613528 -0.789903514 1013
#> 2 site_a 0.38 1.2 1.5   -80   -60  -100 2021-01-01 0.1613528 -0.789903514 1013
#> 3 site_a 0.38 1.2 1.5   -80   -60  -100 2021-02-01 0.1239274  0.005814771 1013
#> 4 site_a 0.38 1.2 1.5   -80   -60  -100 2021-02-01 0.1239274  0.005814771 1013
#> 5 site_a 0.38 1.2 1.5   -80   -60  -100 2021-03-01 0.1222789  2.059937865 1013
#> 6 site_a 0.38 1.2 1.5   -80   -60  -100 2021-03-01 0.1222789  2.059937865 1013
#>        AFPS      DSD0 gas           D0           DS    c_air
#> 1 0.2186472 0.1226866 CO2 1.373780e-05 1.685444e-06 44.73588
#> 2 0.2186472 0.1226866 CH4 1.941795e-05 2.382322e-06 44.73588
#> 3 0.2560726 0.1554984 CO2 1.381053e-05 2.147515e-06 44.60556
#> 4 0.2560726 0.1554984 CH4 1.952075e-05 3.035445e-06 44.60556
#> 5 0.2577211 0.1570023 CO2 1.399908e-05 2.197889e-06 44.27263
#> 6 0.2577211 0.1570023 CH4 1.978726e-05 3.106646e-06 44.27263

This replicates each row of the data.frame and calculates distinct values for each gas.

From measurements to layered data: `discretize_depth()`

As seen above, a cfp_soilphys object often requires the combination of many different parameters. These parameters are typically measured in very different ways and may be present in different data formats. This presents a hurdle, as cfp_soilphys() requires these datasets to all be present in the same layered format. The function discretize_depth() helps with this by providing a unified way to interpolate different observations to the same layered structure. The basic idea is that you provide a vector of depths that represent the layer boundaries. For $n$ layers, you would need $n+1$ depths. Let’s say we measured a temperature profile at three depths: 0, -30 and -100 cm:

temperature_profile <- 
  data.frame(depth = c(0, -30, -100),
             t = c(25, 20, 16))

If we want to map this to 10 cm-thick layers, we can simply define the layer boundaries and use discretize_depth() to linearly interpolate the observations:

boundaries <- seq(0, -100, by = -10)

discretize_depth(temperature_profile,
                 param = "t", 
                 method = "linear",
                 depth_target = boundaries)
#> 
#> A cfp_layered_profile object 
#> id_cols: 
#> 1  unique profiles 
#> 
#>           t depth upper lower
#> 1  16.28571   -95   -90  -100
#> 2  16.85714   -85   -80   -90
#> 3  17.42857   -75   -70   -80
#> 4  18.00000   -65   -60   -70
#> 5  18.57143   -55   -50   -60
#> 6  19.14286   -45   -40   -50
#> 7  19.71429   -35   -30   -40
#> 8  20.83333   -25   -20   -30
#> 9  22.50000   -15   -10   -20
#> 10 24.16667    -5     0   -10

The function can handle profile data and treats every profile distinctly. So if we adapt our example data to two profiles identified by a date column, the interpolation works in the same way, if we provide the necessary id_cols.

temperature_profile_dates <-
  data.frame(date_id = rep(c("date_1", "date_2"), each = 3),
             depth = rep(c(0, -30, -100), times = 2),
             t = c(25, 20, 16, 15, 10, 8)) %>%
  cfp_profile(id_cols = c("date_id"))

discretize_depth(temperature_profile_dates,
                 param = "t", 
                 method = "linear",
                 depth_target = boundaries)
#> 
#> A cfp_layered_profile object 
#> id_cols: date_id 
#> 2  unique profiles 
#> 
#>    date_id         t depth upper lower
#> 1   date_1 16.285714   -95   -90  -100
#> 2   date_1 16.857143   -85   -80   -90
#> 3   date_1 17.428571   -75   -70   -80
#> 4   date_1 18.000000   -65   -60   -70
#> 5   date_1 18.571429   -55   -50   -60
#> 6   date_1 19.142857   -45   -40   -50
#> 7   date_1 19.714286   -35   -30   -40
#> 8   date_1 20.833333   -25   -20   -30
#> 9   date_1 22.500000   -15   -10   -20
#> 10  date_1 24.166667    -5     0   -10
#> 11  date_2  8.142857   -95   -90  -100
#> 12  date_2  8.428571   -85   -80   -90
#> 13  date_2  8.714286   -75   -70   -80
#> 14  date_2  9.000000   -65   -60   -70
#> 15  date_2  9.285714   -55   -50   -60
#> 16  date_2  9.571429   -45   -40   -50
#> 17  date_2  9.857143   -35   -30   -40
#> 18  date_2 10.833333   -25   -20   -30
#> 19  date_2 12.500000   -15   -10   -20
#> 20  date_2 14.166667    -5     0   -10

We can use different interpolation methods that may be more applicable to other parameters. See the function documentation ?discretize_depth for details.

`cfp_layers_map()`: Create the model layers

The final aspect of a ConFluxPro model is the “layers_map”. This is a layered data.frame that defines for which layers we want to calculate flux rates. We provide a dataset that matches the other example datasets:

data("layers_map", package = "ConFluxPro")

layers_map
#>     site upper lower
#> 1 site_a     5     0
#> 2 site_b     7     0
#> 3 site_a     0  -100
#> 4 site_b     0  -100

As you can see, we divide each of the site into two layers, one for the organic layer with depth > 0 and one for the mineral soil below. Notice, that the upper boundary is different between the sites. This is because they have a different organic layer, that is also reflected in the other datasets:

my_gasdata %>%
  group_by(site) %>%
  summarise(max_depth = max(depth))
#> # A tibble: 2 × 2
#>   site   max_depth
#>   <chr>      <dbl>
#> 1 site_a         5
#> 2 site_b         7

my_soilphys %>%
  group_by(site) %>%
  summarise(max_upper = max(upper))
#> # A tibble: 2 × 2
#>   site   max_upper
#>   <chr>      <dbl>
#> 1 site_a         5
#> 2 site_b         7

Notice, that the extends of the layers_map should match your measured data if you want to include it in the models. You cannot extend the layers_map beyond your measurements.

How many layers to define in the cfp_layers_map() is up to the user and depends on the research question and the resolution of the recorded data. While it may be tempting to define a layer between every observation depth in cfp_gasdata(), this may result in artefacts in the modeled fluxes. Combining these layers may therefore yield more robust results.

To create a valid cfp_layers_map, we still need to add some information. These are only needed for a pro_flux() inverse model and are set to NA if missing.

gas: The gas as a character().
lowlim: The lower limit of production allowed in that layer in µmol m^-3 s^-1
highlim: The upper limit of production allowed in that layer in µmol m^-3 s^-1
layer_couple: Penalty factor, defaults to 0. See ?cfp_layers_map()

We can either do so in the function call or provide distinct values ourselves. The highlim and lowlim values can help to explicitly exclude unrealisitic processes in the optimization in pro_flux(). For example, we can set lowlim to 0, if we want to prevent consumption of CO₂. Again, these values are only relevant in the pro_flux() model. In our case, we want to add the limits for the gas "CO2":

my_layers_map <- 
  cfp_layers_map(
    layers_map,
    id_cols = "site",
    gas = "CO2",
    lowlim = 0,
    highlim = 1000
  )
#> 
#> added 'gas' to id_cols

my_layers_map
#> 
#> A cfp_layers_map object 
#> id_cols: site gas 
#> 2  unique profiles 
#> 
#>     site upper lower gas lowlim highlim layer_couple layer pmap
#> 1 site_a     0  -100 CO2      0    1000            0     1    1
#> 2 site_b     0  -100 CO2      0    1000            0     1    1
#> 3 site_a     5     0 CO2      0    1000            0     2    2
#> 4 site_b     7     0 CO2      0    1000            0     2    2

Note that we only added "site" as id_cols, because we don’t need the model frame to change over time and therefore use the same mapping for every Date. You can always use a subset of the id_cols in cfp_soilphys() in your cfp_layers_map(). At the minimum, you need to provide gas within your id_cols.

With this complete, we now have everything to create a unified dataset for a ConFluxPro model.

`cfp_dat()`: Combining the datasets

As a last step, we combine the three datasets into one unified frame with cfp_dat().

my_dat <- cfp_dat(my_gasdata, my_soilphys, my_layers_map)
#> 
#> validating datasets
#> id_cols: site, Date, gas
#> 24 unique profiles

my_dat
#> 
#> A cfp_dat object to be used as input in ConFluxPro models. 
#> id_cols: site Date gas 
#> number of profiles:  24 
#> number of groups:  2

This function does multiple things. (I) It ensures that the upper and lower boundaries between the datasets match. The highest upper of soilphys and gasdata should match the highest depth in gasdata, with the lower boundary correspondingly. (II) It matches each profile to another and makes a list of all profiles and the corresponding datasets. ConFluxPro does not require the id_cols between the different datasets to match exactly, as long as the different profiles can be mapped to another. This makes it possible, for example, to map the same soilphys profiles to multiple distinct gasdata profiles, without having to replicate the soilphys dataset manually. We already make use of this by providing only one layers_map per site in our example that is not replicated for the different Date of the the other datasets. (III) For each profile, it separates the layers of soilphys, at each depth where there is an observation in gasdata. This does not change the calculation but is needed internally for the pro_flux() model. (IV) It adds the column pmap to the soilphys dataset that indicates which layer, as defined in cfp_layers_map() corresponds to that layer in soilphys.

At its core, a cfp_dat object is simply a list. We can access the different elements, including the generated profiles data.frame like any element in a list with the $ operator.

my_dat$profiles %>% head()
#> 
#> A cfp_profile object 
#> id_cols: prof_id 
#> 6  unique profiles 
#> 
#>     site       Date gas gd_id sp_id group_id prof_id
#> 1 site_a 2021-01-01 CO2     1     1        1       1
#> 2 site_b 2021-01-01 CO2    13    13        2       2
#> 3 site_a 2021-02-01 CO2     2     2        1       3
#> 4 site_b 2021-02-01 CO2    14    14        2       4
#> 5 site_a 2021-03-01 CO2     3     3        1       5
#> 6 site_b 2021-03-01 CO2    15    15        2       6

Within this data.frame we have identifiers for each profile of the different datasets (gd_id, sp_id, group_id) and an identifier of each profile of the combined dataset prof_id.

We now are ready model.

Soil data and profiles

What is a soil profile?

Soil data structure

cfp_gasdata(): Gas concentration data

cfp_soilphys(): Soil physical data

Create a cfp_soilphys() object

Derive parameters with complete_soilphys()

From measurements to layered data: discretize_depth()

cfp_layers_map(): Create the model layers