4 Basic Data Management

“I think, therefore I R.”

- William B. King, Psychologist and R enthusiast

$\text{}$
An important characteristic of R is its capacity to efficiently manage and analyze large, complex, datasets. In this chapter I list a few functions and approaches useful for data management in base R. Data management considerations for the tidyverse are given in Chapter 5.

4.1 Operations on Arrays, Lists and Vectors

Operators can be applied individually to every row or column of an array, or every component of a list or atomic vector using a number of time saving methods.

4.1.1 The `apply` Family of Functions

4.1.1.1 `apply()`

Operations can be performed quickly on rows and columns of two dimensional arrays with the function apply(). The function requires three arguments.

The first argument, X, specifies an array to be analyzed.
The second argument, MARGIN, connotes whether rows or columns are to be analyzed. MARGIN = 1 indicates rows, MARGIN = 2 indicates columns, whereas MARGIN = c(1, 2) indicates rows and columns.
The third argument, FUN, defines a function to be applied to the margins of the object in the first argument.

Example 4.1
Consider the asbio::bats dataset which contains forearm length data, in millimeters, for northern myotis bats (Myotis septentrionalis), along with corresponding bat ages in in days.

library(asbio)
data(bats)
head(bats)

  days forearm.length
1    1           10.5
2    1           11.0
3    1           12.3
4    1           13.7
5    1           14.2
6    1           14.8

Here we obtain minimum values for the days and forearm.length columns.

apply(bats, 2, min)

          days forearm.length 
           1.0           10.5

It is straightforward to change the third argument in apply() to obtain different summaries, like the mean.

apply(bats, 2, mean)

          days forearm.length 
        13.579         23.603

or the standard deviation

apply(bats, 2, sd)

          days forearm.length 
       12.4610         8.4347

Several summary statistical functions exist for numerical arrays that can be used in some instances in the place of apply(). These include rowMeans() and colMeans() which give the sample means of specified rows and columns, respectively, and rowSums() and colSums() which give the sums of specified rows and columns, respectively. For instance:

colMeans(bats)

          days forearm.length 
        13.579         23.603

$\blacksquare$

4.1.1.2 `lapply()`

The function lapply() allows one to sweep functions through list components. It has two main arguments:

The first argument, X, specifies a list to be analyzed.
The second argument, FUN, defines a function to be applied to each element in X.

Example 4.2

Consider the following simple list, whose three components have different lengths.

x <- list(a = 1:8, norm.obs = rnorm(10), 
          logic = c(TRUE, TRUE, FALSE, FALSE))
x

$a
[1] 1 2 3 4 5 6 7 8

$norm.obs
 [1]  2.916038 -0.531044 -0.991346  0.475863 -0.159160  0.243012 -0.114082
 [8]  0.012609 -0.950019 -0.627411

$logic
[1]  TRUE  TRUE FALSE FALSE

Here we sweep the function mean() through the list:

lapply(x, mean)

$a
[1] 4.5

$norm.obs
[1] 0.027446

$logic
[1] 0.5

Note the Boolean outcomes in logic have been coerced to numeric outcomes. Specifically, TRUE = 1 and FALSE = 0. Here are the 1st, 2nd (median), and 3rd quartiles of x:

lapply(x, quantile, probs = 1:3/4)

$a
 25%  50%  75% 
2.75 4.50 6.25 

$norm.obs
     25%      50%      75% 
-0.60332 -0.13662  0.18541 

$logic
25% 50% 75% 
0.0 0.5 1.0

$\blacksquare$

4.1.1.3 `sapply()`

The function sapply() is a user friendly wrapper for lapply() that can return a vector or array instead of a list.

sapply(x, quantile, probs = 1:3/4)

       a norm.obs logic
25% 2.75 -0.60332   0.0
50% 4.50 -0.13662   0.5
75% 6.25  0.18541   1.0

4.1.1.4 `tapply()`

The tapply() function allows summarization of a one dimensional array (e.g., a column or row from a matrix) with respect to levels in a categorical variable. The function requires three arguments.

The first argument, X, defines a one dimensional array to be analyzed.
The second argument, INDEX should provide a list of one or more factors (see example below) with the same length as X.
The third argument, FUN, is used to specify a function to be applied to X for each level in INDEX.

Example 4.3 $\text{}$
Consider the dataset asbio::heart, which documents pulse rates for twenty four subjects at four time periods following administration of a experimental treatment. These were two active heart medications and a control. Here are average heart rates for the treatments.

data(heart)
with(heart, tapply(rate, drug, mean))

  AX23   BWW9   Ctrl 
76.281 81.031 71.906

Below are the mean heart rates for treatments, for each time frame. Note that the second argument is defined as a list with two components, each of which can be coerced to be a factor.

with(heart, tapply(rate, list(drug = drug, time = time), mean))

      time
drug      t1     t2     t3     t4
  AX23 70.50 80.500 81.000 73.125
  BWW9 81.75 84.000 78.625 79.750
  Ctrl 72.75 72.375 71.500 71.000

$\blacksquare$

The function aggregate() can be considered a more sophisticated extension of tapply(). It allows objects under consideration to be expressed as functions of explanatory factors, and contains additional arguments for data specification and time series analyses.

Example 4.4 $\text{}$
Here we use aggregate() to get identical (but reformatted) results to the prior example.

aggregate(rate ~ drug + time, mean, data = heart)

   drug time   rate
1  AX23   t1 70.500
2  BWW9   t1 81.750
3  Ctrl   t1 72.750
4  AX23   t2 80.500
5  BWW9   t2 84.000
6  Ctrl   t2 72.375
7  AX23   t3 81.000
8  BWW9   t3 78.625
9  Ctrl   t3 71.500
10 AX23   t4 73.125
11 BWW9   t4 79.750
12 Ctrl   t4 71.000

Importantly, the first argument, rate ~ drug + time is an object of class formula:

f.rate <- with(heart, rate ~ drug + time)
class(f.rate)

[1] "formula"

The tilde operator, ~, allows expression of the formulaic framework: y ~ model, where y is a response variable and model specifies a system of (generally) one or more predictor variables.

Objects of class formula have base type language:

typeof(f.rate)

[1] "language"

The language base type is used for unevaluated expressions other than constants and names. Examples include formulae, and local function assignments.

$\blacksquare$

4.1.2 `outer()`

Another important function for matrix operations is outer(). This algorithm allows creation of an array that contains all possible combinations of two atomic vectors or arrays with respect to a user-specified function. The outer() function has three required arguments.

The first two arguments, X and Y, define arrays or atomic vectors. X and Y can be identical if one wishes to examine pairwise operations of the array elements (see example below).
The third argument, FUN, specifies a function to be used in operations.

Example 4.5 $\text{}$
Suppose I wish to find the means of all possible pairs of observations from a numerical vector. I could use the following commands:

x <- c(1, 2, 3, 5, 4)
outer(x, x, "+")/2

     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0  1.5  2.0  3.0  2.5
[2,]  1.5  2.0  2.5  3.5  3.0
[3,]  2.0  2.5  3.0  4.0  3.5
[4,]  3.0  3.5  4.0  5.0  4.5
[5,]  2.5  3.0  3.5  4.5  4.0

The argument FUN = "+" indicates that we wish to add elements to each other. We divide these sums by two to obtain means. Note that the diagonal of the output matrix contains the original elements of x, because the mean of a number and itself is the original number. The upper and lower triangles are identical because the mean of elements a and b will be the same as the mean of the elements b and a. Note that the matrix outer product of two vectors x and y can be obtained using outer(x, y, "*") or simply outer(x, y) (Section 3.1.2.1).

outer(x, x, "*")

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    5    4
[2,]    2    4    6   10    8
[3,]    3    6    9   15   12
[4,]    5   10   15   25   20
[5,]    4    8   12   20   16

x %o% x

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    5    4
[2,]    2    4    6   10    8
[3,]    3    6    9   15   12
[4,]    5   10   15   25   20
[5,]    4    8   12   20   16

$\blacksquare$

4.1.3 `stack()`, `unstack()` and `reshape()`

When manipulating lists and dataframes it is often useful to move between so-called “long” and “wide” data table formats. These operations can be handled with the functions stack() and unstack(). Specifically, stack() concatenates multiple vectors into a single vector along with a factor indicating where each observation originated, whereas unstack() reverses this process.

Example 4.6 $\text{}$
Consider the 4 x 4 dataframe below.

dataf <- data.frame(matrix(nrow = 4, data = rnorm(16)))
names(dataf) <- c("col1", "col2", "col3", "col4")
dataf

      col1     col2      col3      col4
1 -0.32059  0.42073  1.158630 -0.881494
2  1.14202  1.53656  0.424596 -0.697975
3  2.61165 -0.93707 -0.066403  0.655689
4  1.16336  0.47024 -2.195108  0.048339

Here I stack dataf into a long table format.

sdataf <- stack(dataf)
sdataf

      values  ind
1  -0.320588 col1
2   1.142015 col1
3   2.611650 col1
4   1.163358 col1
5   0.420732 col2
6   1.536556 col2
7  -0.937070 col2
8   0.470242 col2
9   1.158630 col3
10  0.424596 col3
11 -0.066403 col3
12 -2.195108 col3
13 -0.881494 col4
14 -0.697975 col4
15  0.655689 col4
16  0.048339 col4

Here I unstack sdataf.

unstack(sdataf)

      col1     col2      col3      col4
1 -0.32059  0.42073  1.158630 -0.881494
2  1.14202  1.53656  0.424596 -0.697975
3  2.61165 -0.93707 -0.066403  0.655689
4  1.16336  0.47024 -2.195108  0.048339

The function reshape() can handle both stacking and unstacking operations. Here I stack dataf. The arguments timevar, idvar, and v.names are used to provide recognizable identifiers for the columns in the wide table format, observations within those columns, and responses for those combinations.

reshape(dataf, direction = "long",
        varying = list(names(dataf)),
        timevar = "Column",
        idvar = "Column obs.",
        v.names = "Response")

    Column  Response Column obs.
1.1      1 -0.320588           1
2.1      1  1.142015           2
3.1      1  2.611650           3
4.1      1  1.163358           4
1.2      2  0.420732           1
2.2      2  1.536556           2
3.2      2 -0.937070           3
4.2      2  0.470242           4
1.3      3  1.158630           1
2.3      3  0.424596           2
3.3      3 -0.066403           3
4.3      3 -2.195108           4
1.4      4 -0.881494           1
2.4      4 -0.697975           2
3.4      4  0.655689           3
4.4      4  0.048339           4

$\blacksquare$

4.2 Other Simple Data Management Functions

4.2.1 `replace()`

One use the function replace() to replace elements in an atomic vector based, potentially, on Boolean logic. The function requires three arguments.

The first argument, x, specifies the vector to be analyzed.
The second argument, list, connotes which elements need to be replaced. A logical argument can be used here as a replacement index.
The third argument, values, defines the replacement value(s).

Example 4.7 $\text{}$
For instance:

Age <- c(21, 19, 25, 26, 18, 19)
replace(Age, Age < 19, "Young")

[1] "21"    "19"    "25"    "26"    "Young" "19"

Of course, one can also use square brackets for this operation.

Age[Age < 19] <- "Young"
Age

[1] "21"    "19"    "25"    "26"    "Young" "19"

$\blacksquare$

4.2.2 `which()`

The function which can be used with logical commands to obtain address indices for data storage object.

Example 4.8
For instance:

Age <- c(21, 19, 25, 26, 18, 19)
w <- which(Age <= 21)
w

[1] 1 2 5 6

Elements one, two, and five meet this criterion. We can now subset based on the index w.

Age[w]

[1] 21 19 18 19

To find which element in Age is closest to 24 I could do something like:

which(abs(Age - 24) == min(abs(Age - 24)))

[1] 3

$\blacksquare$

4.2.3 `sort()`

By default, The function sort() sorts data from an atomic vector into an alphanumeric ascending order.

sort(Age)

[1] 18 19 19 21 25 26

Data can be sorted in a descending order by specifying decreasing = TRUE.

sort(Age, decreasing = T)

[1] 26 25 21 19 19 18

4.2.4 `rank()`

The function rank gives the ascending alphanumeric rank of elements in a vector. Ties are given the average of their ranks. This operation is important to rank-based permutation analyses .

rank(Age)

[1] 4.0 2.5 5.0 6.0 1.0 2.5

The second and last observations were the second smallest in Age. Thus, their average rank is 2.5.

4.2.5 `order()`

The function order() is similar to which() in that it provides element indices that accord with an alphanumeric ordering. This allows one to sort a vector, matrix or dataframe into an ascending or descending order, based on one or several ordered vectors.

Example 4.9 $\text{}$
Consider the dataframe below which lists plant percent cover data for four plant species at three sites. In accordance with the field.data example from Ch 3, plant species are identified with four letter codes, corresponding to the first two letters of the taxa genus and species names.

field.data <- data.frame(code = c("ACMI", "ELSC", "CAEL", "TACE"),
                         site1 = c(12, 13, 14, 11),
                         site2 = c(0, 20, 4, 5),
                         site3 = c(20, 10, 30, 0))
field.data

  code site1 site2 site3
1 ACMI    12     0    20
2 ELSC    13    20    10
3 CAEL    14     4    30
4 TACE    11     5     0

Assume that we wish to sort the data with respect to an alphanumeric ordering of species codes. Here we obtain the ordering of the codes

o <- order(field.data$code)
o

[1] 1 3 2 4

Now we can sort the rows of field.data based on this ordering.

field.data[o,]

  code site1 site2 site3
1 ACMI    12     0    20
3 CAEL    14     4    30
2 ELSC    13    20    10
4 TACE    11     5     0

$\blacksquare$

4.2.6 `unique()`

To identify unique values in dataset we can use the function unique().

Example 4.10
Below is an atomic vector listing species from a bird survey on islands in southeast Alaska. Species ciphers follow the same coding method used in Example 4.9. Note that there are a large number of repeats.

AK.bird <- c("GLGU", "MEGU", "DOCO", "PAJA", "COLO", "BUFF", "COGO", 
             "WHSC", "TUSW", "GRSC", "GRTE", "REME", "BLOY", "REPH", 
             "SEPL", "LESA", "ROSA", "WESA", "WISN", "BAEA", "SHOW", 
             "GLGU", "MEGU", "PAJA", "DOCO", "GRSC", "GRTE", "BUFF", 
             "MADU", "TUSW", "REME", "SEPL", "REPH", "ROSA", "LESA", 
             "COSN", "BAEA", "ROHA")

length(AK.bird)

[1] 38

Applying unique() we obtain a listing of the 24 unique bird species.

unique(AK.bird)

 [1] "GLGU" "MEGU" "DOCO" "PAJA" "COLO" "BUFF" "COGO" "WHSC" "TUSW" "GRSC"
[11] "GRTE" "REME" "BLOY" "REPH" "SEPL" "LESA" "ROSA" "WESA" "WISN" "BAEA"
[21] "SHOW" "MADU" "COSN" "ROHA"

$\blacksquare$

4.2.7 `match()`

Given two vectors, the function match() indexes where objects in the second vector appear in the elements of the first vector. For instance:

x <- c(6, 5, 4, 3, 2, 7)
y <- c(2, 1, 4, 3, 5, 6)
m <- match(y, x)
m

[1]  5 NA  3  4  2  1

The number 2 (the 1st element in y) is the 5th element of x, thus the number 5 is put 1st in the vector m created by match. The number 1 (the 2nd element of y) does not occur in x (it is NA). The number 4 is the 3rd element of y and x. Thus, the number 3 is given as the third element of m, and so on.

Example 4.11 $\text{}$
The usefulness of match() may seem unclear at first, but consider a scenario in which I want to convert species code identifiers in field data into actual species names. The following dataframe is a species list that matches four letter species codes to scientific names. Note that the list contains more species than than the field.data dataset used in Example 4.9.

species.list <- data.frame(code = c("ACMI", "ASFO", "ELSC", "ERRY",
                                    "CAEL", "CAPA", "TACE"), 
                           names = c("Achillea millefolium", 
                                     "Aster foliaceus", 
                                     "Elymus scribneri", 
                                     "Erigeron rydbergii", 
                                     "Carex elynoides", 
                                     "Carex paysonis", 
                                     "Taraxacum ceratophorum"))

species.list

  code                  names
1 ACMI   Achillea millefolium
2 ASFO        Aster foliaceus
3 ELSC       Elymus scribneri
4 ERRY     Erigeron rydbergii
5 CAEL        Carex elynoides
6 CAPA         Carex paysonis
7 TACE Taraxacum ceratophorum

Here I add a column in the field.data containing the actual species names using match().

m <- match(field.data$code, species.list$code)
field.data.new <- field.data # make a copy of field data
field.data.new$species.name <- species.list$names[m]
field.data.new

  code site1 site2 site3           species.name
1 ACMI    12     0    20   Achillea millefolium
2 ELSC    13    20    10       Elymus scribneri
3 CAEL    14     4    30        Carex elynoides
4 TACE    11     5     0 Taraxacum ceratophorum

$\blacksquare$

4.2.8 `which()` and `%in%`

We can use the operator %in% in conjunction with the function which() to achieve the same results as match().

m <- which(species.list$code %in% field.data$code)
field.data.new$species.name <- species.list$names[m]
field.data.new

  code site1 site2 site3           species.name
1 ACMI    12     0    20   Achillea millefolium
2 ELSC    13    20    10       Elymus scribneri
3 CAEL    14     4    30        Carex elynoides
4 TACE    11     5     0 Taraxacum ceratophorum

Note that the arrangement of arguments are reversed in match() and which(). In the former we have: match(field.data$code, species.list$code). In the latter we have: which(species.list$code %in% field.data$code).

4.3 Matching, Querying and Substituting in Strings

R contains a number of useful methods for handling character string data. Strings will have class and base type character.

4.3.1 `strtrim()` and `substr()`

The functions strtrim() and substr() are useful for extracting subsets from strings or string vectors.

Example 4.12 $\text{}$
For the taxonomic codes in the character vector below, the first capital letter indicates whether a species is a flowering plant (anthophyte) or moss (bryophyte) while the last four letters give species codes (see Example 4.9).

plant <- c("A_CAAT", "B_CASP", "A_SARI")

Assume that I want to distinguish anthophytes from bryophytes by extracting the first letter. This can be done by specifying 1 in the second strtrim argument, width.

phylum <- strtrim(plant, 1)
phylum

[1] "A" "B" "A"

plant[phylum == "A"]

[1] "A_CAAT" "A_SARI"

The function substr() is useful for imposing the start and end of strings to be subset. Here I extract string components 3-4 (the first two letters of the genus name).

substr(plant, 3, 4)

[1] "CA" "CA" "SA"

$\blacksquare$

4.3.2 `strsplit()`

The function strsplit() splits a character string into substrings based on user defined criteria. It contains two important arguments.

The first argument, x, specifies the character string to be analyzed.
The second argument, split, is a character criterion that is used for splitting.

Example 4.13 $\text{}$
Below I split the character string ACMI in two, based on the space between the words Achillea and millefolium.

ACMI <- "Achillea millefolium"
strsplit(ACMI, " ")

[[1]]
[1] "Achillea"    "millefolium"

Note that the result is a list. To get back to a vector (now with two components), I can use the function unlist().

unlist(strsplit(ACMI, " "))

[1] "Achillea"    "millefolium"

Here I split based on the letter "l".

strsplit(ACMI, "l")

[[1]]
[1] "Achi"  ""      "ea mi" ""      "efo"   "ium"

Interestingly, letting the split criterion equal NULL results in spaces being placed between every character in a string.

strsplit(ACMI, NULL)

[[1]]
 [1] "A" "c" "h" "i" "l" "l" "e" "a" " " "m" "i" "l" "l" "e" "f" "o" "l" "i"
[19] "u" "m"

We can use this outcome to reverse the order of characters in a string.

sapply(lapply(strsplit(ACMI, NULL), rev), paste, collapse = "")

[1] "muilofellim aellihcA"

The function rev() provides a reversed version of its first argument, in this case a result from strsplit(). The function paste(), and its argument collapse = "", which are applied within sapply(), convert the 20 strings created by strsplit(), into a single string, and remove spaces between characters in the string, respectively.

$\blacksquare$

Criteria for querying strings can include multiple characters in a particular order, and a particular case:

x <- "R is free software and comes with ABSOLUTELY NO WARRANTY"
strsplit(x, "so")

[[1]]
[1] "R is free "                                  
[2] "ftware and comes with ABSOLUTELY NO WARRANTY"

Note that the "SO" in "ABSOLUTELY" is ignored because it is upper case.

4.3.3 Regular Expressions

In computer programming, a regular expression (often abbreviated regex) is a coding system that facilitates the matching of character patterns in strings. Regular expressions have developed within a number of programming frameworks including the POSIX standard, developed by the IEEE, and particularly the language Perl. POSIX standard regex formats include basic regular expressions (BRE) and extended regular expressions (ERE). Perl-compatible regular expressions (PCRE) are contained in a C library inspired by Perl version 5. Useful guidance for these approaches can be found here. There a number of base R functions that can query, filter, and transform text using regular expressions including grep(), grepl(), gregexpr(), gsub(), and strsplit(). These generally attempt to emulate foundational Unix processes including grep, sed, and AWK (Sections 9.2.2, 9.2.5) which are implemented through Unix-alike system shells (Section 9.2).

4.3.3.1 `grep()` and `grepl()`

The functions grep() and grepl() can be used to identify which elements in a character vector have a specified pattern. The functions have the same first two arguments.

The first argument, pattern, specifies a patterns to be matched. This can be a character string, an object coercible to a character string, or a regular expression.
The second argument, x, is a character vector where matches are sought.

Example 4.14 $\text{}$
The function grep() returns indices identifying which entries in a vector contain a queried pattern. In the character vector below, we see that entries five and six have the same genus, Carex.

names = c("Achillea millefolium", "Aster foliaceus", 
          "Elymus scribneri", "Erigeron rydbergii", 
          "Carex elynoides", "Carex paysonis", 
          "Taraxacum ceratophorum")

grep("Carex", names)

[1] 5 6

The function grepl() does the same thing with Boolean outcomes.

grepl("Carex", names)

[1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE

Of course, we could use this information to subset names.

names[grep("Carex", names)]

[1] "Carex elynoides" "Carex paysonis"

We can also get grep to return the values directly by specifying value = TRUE.

grep("Carex", names, value = TRUE)

[1] "Carex elynoides" "Carex paysonis"

$\blacksquare$

4.3.3.2 `gsub()`

The function gsub() can be used to substitute text that has a specified pattern. Several of its arguments are identical to grep() and grepl():

As before, the first argument, pattern, specifies a pattern to be matched.
The second argument, replacement, specifies a replacement for the matched pattern.
The third argument, x, is a character vector wherein matches are sought and substitutions are made.

Example 4.15 $\text{}$
Here we substitute "C." for occurrences of "Carex" in names.

gsub("Carex", "C.", names)

[1] "Achillea millefolium"   "Aster foliaceus"       
[3] "Elymus scribneri"       "Erigeron rydbergii"    
[5] "C. elynoides"           "C. paysonis"           
[7] "Taraxacum ceratophorum"

$\blacksquare$

4.3.3.3 `gregexpr()`

The function gregexpr() identifies the start and end of matching sections in a character vector, potentially using regular expressions.

Example 4.16 $\text{}$
Here we examine the first two entries in names, looking for the genus Aster.

gregexpr("Aster", names[c(1:2)])

[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 1
attr(,"match.length")
[1] 5
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

The output list is cryptic and requires explanation. The first two elements in each of the two list components indicate the character number of the start and end of the matched string. For the first list component, these elements are given the identifier -1 because "Achillea millefolium" does not contain the pattern "Aster". For the second list component, these elements are 1 and 5 because "Aster" makes up the first five letters of "Aster foliaceus".

$\blacksquare$

4.3.3.4 Extended Regular Expressions

Extended regular expressions comprise the default approach for most R regex functions⁶³. EREs allow the use of the metacharacters: \, |, ( ), [ ], ^, $, ., { }, *, +, and ?. These metacharacters will vary in meaning depending if they occur outside or inside of square brackets, [ and ] in pattern queries. The former usage means that the metacharacters are part of a character class (see below). In the latter (non-bracketed) usage, the metacharacters in the subset below have the following applications (see https://www.pcre.org/original/pcre.txt):

^ start of string or line.
$ end of string or line.
. match any character except newline.
| start of alternative branch.
( ) start and end sub-pattern.
{ } start and end min/max repetition specification.

Several regex metacharacters can be placed at the end of a regular expression to specify types of repetition. For instance, "*" indicates the preceding pattern should be matched zero or more times, "{+}" indicates the preceding pattern should be matched one or more times, "{n}" indicates the preceding pattern should be matched exactly n more times ("Hel{2}o" matches with “Hello”), and "{n,}" indicates the preceding pattern should be matched n or more times. The ? character matches a preceding pattern 0 or 1 times.

Example 4.17 $\text{}$
We will use the function regmatches(), which extracts or replaces matched substrings from gregexpr() summaries, to illustrate.

string <- "%aaabaaab"
ID <- gregexpr("a{1}", string)
regmatches(string, ID)

[[1]]
[1] "a" "a" "a" "a" "a" "a"

ID <- gregexpr("a?", string)
regmatches(string, ID)

[[1]]
[1] ""  "a" "a" "a" ""  "a" "a" "a" ""

ID <- gregexpr("a{2}", string)
regmatches(string, ID)

[[1]]
[1] "aa" "aa"

ID <- gregexpr("a{2,}", string)
regmatches(string, ID)

[[1]]
[1] "aaa" "aaa"

$\blacksquare$

Example 4.18 $\text{}$
Metacharacters can be used together. For instance, the code below demonstrates how one might get rid of an unwanted characters, and delete one or more extra spaces at the end of character strings.

string <- c("###Nothing in biology      ",
            "# makes sense \nexcept  ",
            "#",
            " in the light",
            "### of evolution (Dobzhansky).       ")

out <- gsub(" +$", "", string) # drop space(s) at string ends
out <- gsub("^#*","", out) # drop unwanted pound sign(s)

cat(paste(out, collapse = ""))

$\blacksquare$

Example 4.19 $\text{}$
Microbial “taxa” identifiers can include cryptic Amplicon Sequence Variant (ASV) codes, followed by a general taxonomic assignment. For example, here is an ASV identifier for a bacterium within the family Comamonadaceae.

asv <- "6abc517aa40e9e7b9c652902fe04bb1a:f__Comamonadaceae"

We can delete the ASV code, which ends in a colon, with:

gsub(".*:", "", asv)

[1] "f__Comamonadaceae"

The regex script in the first argument means: “match any character occurring zero or more times within a string that ends in :”.

$\blacksquare$

Marking/grouping sub-expressions

When using gsub(), we can use parentheses, ( and ), to mark up to nine string sub-expressions. These components can then be modified individually using numbered back-references, reflecting the order that the sub-expressions occur in the string. The numbering in the back-references require double back-slashes. That is, the back-reference \\1 refers to the first sub-expression.

Example 4.20 $\text{}$
Here we create a famous quote by repeating two defined sub-expressions from a string.

gsub("(.*) (.*)", "The name is \\2, \\1 \\2." , "James Bond")

[1] "The name is Bond, James Bond."

By specifying "(.*) (.*)" in the first (pattern) argument of gsub(), the first and second sub-expressions are defined to be a string of (essentially) any characters, of any length, separated by a white space. Thus, "James" (from "James Bond") is defined as the first sub-expression, and "Bond" is defined to be the second sub-expression. These are manipulated in the second (replacement) in gsub() using additional text, and numeric back-references.

$\blacksquare$

Backslashes and regex

An undesirable side-effect of using regular expressions in R is the need to use double backslashes, \\, or even quadruple backslashes, \\\\, in queries⁶⁴. This is due to the fact that the backslash character is often used programmatically for two (opposite) purposes (Haddock and Dunn 2011).

First, one can use a backslash to escape a character. That is, to insure that a program sees the character literally (and not as a metacharacter with a special meaning).

Example 4.21 $\text{}$
For instance, to search for the presence of the actual ^ character (which is also a regex metacharacter), in the string "E = mc^2", I would have to do some thing like:

grepl("\\^", "E = mc^2")

[1] TRUE

The first (inner) backslash escapes the ^ character and the second (outer) backslash escapes the first backslash (which is also a regex metacharacter).

R regex queries for literal backslashes in a string requires that those backslashes are doubled in the character vector argument, x, where matches are sought (to escape the escape) and quadrupled in the pattern argument:

gsub("\\\\", " ", "This\\is\\a\\backslash")

[1] "This is a backslash"

$\blacksquare$

Example 4.22 $\text{}$
As another example, Markdown delimits monospace “code” font (i.e., font) using accent grave (backtick) characters, ` `, while the LaTeX language applies this font between the expression \texttt{ and }. Below I convert a R LaTex-style character vector containing some strings to an Markdown character vector.

char.vec <- c("\texttt{+}", "addition", "$2 + 2$", 
              "\texttt{2 + 2}")
gsub("(\texttt\\{)(.*)(\\})","`\\2`", char.vec)

[1] "`+`"      "addition" "$2 + 2$"  "`2 + 2`"

With the code…

"(\texttt\\{)(.*)(\\})

I subset R Markdown strings in char.vec into three potential components using the regex group marker metacharacters ( and ). First, \texttt\\{ designates the beginning of monospace “code” font in LaTeX. Note that the curly brace metacharacter in the snippet is double escaped. Second, the text to be monospace formatted within \texttt{} is specified, flexibly, to be any character, of any length, using .*. Third, the (double escaped) closing curly brace is given, \\}. The three three components of the query, specified internally as \\1 \\2 and \\3, can be replaced individually using the second (replacement) argument in gsub(). In particular, to replace LaTeX text that has monospace code formatting, with equivalent text formatting in Markdown, I specify: "`\\2`". This means: “place the contents from the second component (of the first (pattern) argument of gsub()), and place it between backticks.”

Notably, when defining a regex group using the Unix utility grep, back-references are: 1) stated more simply as \1, \2, etc., and 2) generally given as a single statement along with their corresponding sub-expressions.

$\blacksquare$

Example 4.23 $\text{}$
To reverse this process, i.e., to go from Markdown to LaTeX, I have:

char.vec <- c("`+`", "addition", "$2 + 2$", "`2 + 2`")
gsub("(`)(.*)(`)","\texttt{\\2}", char.vec)

[1] "\texttt{+}"     "addition"       "$2 + 2$"        "\texttt{2 + 2}"

Note that the replacement string is less demanding with respect to escape characters than the pattern argument. Specifically, although double backslashes were required to escape the curly brace, { }, metacharacters in the pattern argument of the previous example: (\texttt\\{)(.*)(\\}), they were not required to specify writing { and } in the replacement string: \texttt{\\2} of the current example. Inclusion of extra backslashes generally does not adversely affect regex queries. For instance, I could have used replacement string: \\\texttt\\{\\2\\} above and gotten the same result. Thus, whenever one desires that a character be viewed literally in a regex process it is not a bad idea to escape it (double escape in R).

$\blacksquare$

Second, one can occasionally use a backslash to impart a special meaning to a character. For example, some ASCII commands can be initiated by placing a backslash in front of a character. These include \t, which denotes the horizontal tab character (ASCII code 9), \n, the new-line/line-feed character (ASCII code 10), and \e is the ASCII escape character (ASCII code 27) (see Section 12.8). These literals can be called with their single inherent backslash in base R regex procedures (see Example 4.24 below). Additionally, recall (Section 3.9) that UTF-8 characters (which include ASCII characters) can be called in R using a (single) backslash, followed by the character u (upper or lower case), and the Unicode hexadecimal number (see Example 4.25). Unicode characters and hexadecimal encoding are formally considered in Ch 12. In the context of regex, one can define wildcards (special operations that can potentially match a pattern more than once in a string) using a preceding backslash. For example, the regex wildcard operation \d{+} would match occurrences of one or more digits in a string (see Section 4.3.3.5 below). In base R regex calls, however, two preceding backslashes are required for wildcards. That is, one would specify \\d{+} instead of \d{+}.

Example 4.24 $\text{}$
Here I query tabs in a string.

gsub("\t", "This is a tab. ", "\t\t\t")

[1] "This is a tab. This is a tab. This is a tab. "

$\blacksquare$

Example 4.25 $\text{}$
Here I print the Unicode character for $\mu$:

cat("The Unicode cipher for \u00B5 is \\u00B5.")

The Unicode cipher for µ is \u00B5.

Note that I doubled the backslashes to print a single backslash using the base “concatenate and print” function cat().

$\blacksquare$

Use of sequential backslashes for defining wildcards (and escaping regex metacharacters) will generally be unnecessary when using regular expressions outside of R. For instance, when specifying regex wildcards from a system shell (Section 9.2).

Character set

A regular expression character set is comprised of a collection of characters, specifying some query or pattern, situated between quotes (single or double) and square brace metacharacters, e.g., "[" and "]". Character set pattern matches will be evaluated for any single character in the specified text. The reverse will occur if the first character of the pattern is the regular expression caret metacharacter, ^. For example, the expression "[0-9]" matches any single numeric character in a string, (the regular expression metacharacter - can be used to specify a range) and [^abc] matches anything except the characters "a", "b" or "c".

Example 4.26 $\text{}$
Here I ask for a string split based on the appearance of ? (which is a regex metacharacter) and % (which is not), using a character set.

string <- "m?2%b"
strsplit(string, "[\\?%]")

[[1]]
[1] "m" "2" "b"

$\blacksquare$

Example 4.27 $\text{}$
Consider the following examples:

string <- "a1c&m2%b"
strsplit(string, "[0-9]")

[[1]]
[1] "a"   "c&m" "%b"

strsplit(string, "[^abc]")

[[1]]
[1] "a" "c" ""  ""  ""  "b"

$\blacksquare$

Example 4.28 $\text{}$
This regular expression will match most email addresses:

pattern <- "[-a-z0-9_.%]+\\@[-a-z0-9_.%]+\\.[a-z]+"

The expression literally reads: “1) find one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 2) the ampersand symbol (literally), followed by 3) one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 4) a literal period, followed by one or more occurrences of the letters a-z or A-Z.” Here is a string we wish to query:

string <- c("abc_noboby@isu.edu",
            "text with no email",
            "me@mything.com",
            "also",
            "you@yourspace.com",
            "@you"
            )

We confirm that elements 1, 3, and 5 from string are email addresses.

grep(pattern, string, ignore.case = TRUE, value = TRUE)

[1] "abc_noboby@isu.edu" "me@mything.com"     "you@yourspace.com"

$\blacksquare$

Certain character sets are predefined. These character classes have names that are bounded by two square brackets and colons, and include "[[:lower:]]" and "[[:upper:]]" which identify lower and upper case letters, "[[:punct:]]" which identifies punctuation, [[:alnum:]], which identifies all alphanumeric characters, and "[[:space:]]", which identifies space characters, e.g., tab and newline. Outside of R, one can generally call regex character sets and character classes using the simpler format: [pattern] and [:pattern:], respectively.

string <- c("M2Ab", "def", "?", "%", "\n")
grepl("[[:lower:]]", string)

[1]  TRUE  TRUE FALSE FALSE FALSE

grepl("[[:upper:]]", string)

[1]  TRUE FALSE FALSE FALSE FALSE

grepl("[[:punct:]]", string)

[1] FALSE FALSE  TRUE  TRUE FALSE

grepl("[[:space:]]", string) # item five is a newline request

[1] FALSE FALSE FALSE FALSE  TRUE

Here I ask R to return elements from string that are three or more characters long.

grep("[[:alnum:]]{3}", string, value = TRUE)

[1] "M2Ab" "def"

Turning off regular expressions

For some pattern matching and replacement jobs it may be best turn off the default extended regular expressions and use exact matching by specifying fixed = TRUE. For example, R may place periods in the place of spaces in character strings and column names in dataframes and arrays.

Example 4.29 $\text{}$
Consider the following example:

countries <- c("United.States", "United.Arab.Emirates", 
               "China", "Germany")
gsub(".", " ", countries)

[1] "             "        "                    " "     "               
[4] "       "

Note that using gsub(".", " ", countries) results in the replacement of all text with spaces because of the meaning of the period regex metacharacter. To get the desired result we could use:

gsub(".", " ", countries, fixed = TRUE)

[1] "United States"        "United Arab Emirates" "China"               
[4] "Germany"

Of course we could also (double) escape the period.

gsub("\\.", " ", countries)

[1] "United States"        "United Arab Emirates" "China"               
[4] "Germany"

$\blacksquare$

4.3.3.5 Perl-Compatible Regular Expressions

The R character string functions grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit() allow Perl-like regular expression pattern matching. This is done by specifying perl = TRUE, which switches regular expression handling to the PRCE library. Perl allows handling of the POSIX predefined character classes, e.g., "[[:lower:]]", along with a wide variety of other calls which are generally implemented using (double) backslashes, which , when combined with certain characters, creates a wildcard command. Here are some examples. - \\d any digit character (equivalent to [[:digit:]]). - \\D any character that is not a digit (equivalent to [^[:digit:]]). - \\h any horizontal white space character (e.g., tab, space). - \\H any character that is not a horizontal white space character. - \\s any white space character (equivalent to [[:space:]]). - \\S any character that is not a white space character (equivalent to [^[:space:]]). - \\v any vertical white space character (e.g., newline). - \\V any character that is not a vertical white space character. - \\w any word character, (i.e, a-z, A-Z, 0-9, and _). Equivalent to [[:alnum:]_]. - \\W any non word character (equivalent to [^[:alnum:]_]). - \\b a word boundary. - \\U upper case character (dependent on context). - \\L lower case character (dependent on context).

Note that reversals in meaning occur for capitalized and uncapitalized commands. Many other Perl-like modifications can be made to R regex functions (see ?regex.)

Example 4.30 $\text{}$

Here we identify string entries containing numbers.

string <- c("Acidobacteria", "Actinobacteria", "TM7.1", 
            "Gitt-GS-136", "Chloroflexia", "Bacili")

grep("\\d", string, perl = TRUE)

[1] 3 4

And those containing non-numeric characters (i.e., all of the entries).

grep("\\D", string, perl = TRUE)

[1] 1 2 3 4 5 6

To subset non-numeric entries, one could do something like:

string[-grep("\\d", string, perl = TRUE)]

[1] "Acidobacteria"  "Actinobacteria" "Chloroflexia"   "Bacili"

$\blacksquare$

Example 4.31 $\text{}$
As a slightly extended example, I will count the number of words in a short excerpt from “On the Origin of Species” Darwin (1964) available at https://amalgamofr.org/origin.txt. Ideas here largely follow from the function DescTools::StrCountW() (Signorell 2023). Below I use readLines() to read in the file (made up of a single, long, character string) from my working directory, and print the first 54 characters.

origin <- readLines("origin.txt")
substring(origin, 1, 54)

[1] "If, under changing conditions of life, organic beings "

To search for words, I will actually identify string components that are not words, using the PCRE wildcard \\W, and identify word boundaries with the wildcard \\b. I combine these wildcards as: \\b\\W+\\b. The call \\W+ indicates a non-word match occurring one or more times.

sum(sapply(gregexpr("\\b\\W+\\b", origin, perl = TRUE),
           function(x) sum(x > 0)) + 1)

[1] 318

There are 318 total words in the excerpt.

$\blacksquare$

One can identify substrings by number using a Perl-like approach.

Example 4.32 $\text{}$
In this example, I subdivide a string into two components, the first character, i.e., "(\\w)", and the remaining zero or more characters: "(\\w*)". These are referred to in the replacement argument of gsub as items \\1 and \\2, respectively. Capitalization for these substrings are handled in different ways below.

string <- "achillea"
# all caps
gsub("(\\w)(\\w*)", "\\U\\1\\U\\2", string, perl=TRUE)

[1] "ACHILLEA"

# lower, then upper case
gsub("(\\w)(\\w*)", "\\L\\1\\U\\2", string, perl=TRUE)

[1] "aCHILLEA"

# upper, then lower case
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", string, perl=TRUE)

[1] "Achillea"

The functions tolower() and toupper() provide simpler approaches to convert letters to lower and upper case, respectively.

toupper(string)

[1] "ACHILLEA"

$\blacksquare$

4.4 Date-Time Classes

There are two basic R date-time classes, POSIXlt and POSIXct⁶⁵. Class POSIXct represents the (signed) number of seconds since the beginning of 1970 (in the UTC time zone) as a numeric vector. An object of class POSIXlt will be comprised of a list of vectors with the names sec, min, hour, mday (day of month), mon (month), year, wday (day of week), and yday (day of year).

POSIX naming conventions include:

%m = Month as a decimal number (01–12).
%d = Day of the month as a decimal number (01–31).
%Y = Year. Designations in 0:9999 are accepted.
%H = Hour as a decimal number (00–23).
%M = Minute as a decimal number (00–59).
%S = Second as a decimal number (00–59).

Example 4.33 $\text{}$
As an example, below are twenty dates and corresponding binary water presence measures (0 = water absent, 1 = water present) recorded at 2.5 hour intervals for an intermittent stream site in southwest Idaho (K. Aho, Derryberry, et al. 2023).

dates <- c("08/13/2019 04:00", "08/13/2019 06:30", "08/13/2019 09:00",
           "08/13/2019 11:30", "08/13/2019 14:00", "08/13/2019 16:30",
           "08/13/2019 19:00", "08/13/2019 21:30", "08/14/2019 00:00",
           "08/14/2019 02:30", "08/14/2019 05:00", "08/14/2019 07:30",
           "08/14/2019 10:00", "08/14/2019 12:30", "08/14/2019 15:00",
           "08/14/2019 17:30", "08/14/2019 20:00", "08/14/2019 22:30",
           "08/15/2019 01:00", "08/15/2019 03:30")

pres.abs <- c(1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)

To convert the character string dates to a date-time object we can use the function strptime(). We have:

dates.ts <- strptime(dates, format = "%m/%d/%Y %H:%M")
class(dates.ts)

[1] "POSIXlt" "POSIXt"

Note that the dates can now be evaluated numerically.

dates.df <- data.frame(dates = dates.ts, pres.abs = pres.abs)
summary(dates.df)

     dates                        pres.abs   
 Min.   :2019-08-13 04:00:00   Min.   :0.00  
 1st Qu.:2019-08-13 15:52:30   1st Qu.:0.75  
 Median :2019-08-14 03:45:00   Median :1.00  
 Mean   :2019-08-14 03:45:00   Mean   :0.75  
 3rd Qu.:2019-08-14 15:37:30   3rd Qu.:1.00  
 Max.   :2019-08-15 03:30:00   Max.   :1.00

I can also easily extract time series components.

dates.ts$mday # day of month

 [1] 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15

dates.ts$wday # day of week

 [1] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4

dates.ts$hour # hour

 [1]  4  6  9 11 14 16 19 21  0  2  5  7 10 12 15 17 20 22  1  3

$\blacksquare$

Exercises

Using the plant dataset from Question 5 in the Exercises at the end of Chapter 3, perform the following operations.
1. Attempt to simultaneously calculate the column means for plant height and soil % N using FUN = mean in apply(). Was there an issue? Why?
2. Eliminate missing rows in plant using na.omit() and repeat (a). Did this change the mean for plant height? Why?
3. Modify the FUN argument in apply() to be: FUN = function(x) mean(x, na.rm = TRUE). This will eliminate NAs on a column by column basis.
4. Compare the results in (a), (b), (c). Which is the best approach? Why?
5. Find the mean and variance of plant heights for each Management Type in plant using tapply(). Use the best practice approach for FUN, as deduced in (d).
For the questions below, use the list object list.data.
1. Use sapply(list.data, FUN = length) to get the number of components in each element of list.data.
2. Repeat (a) using lapply(). How is the output in (b) different from (a)?
```
list.data <- list(a = 1:9, height = rnorm(50), 
              greet = c("hello", "goodbye", "hello"))
```
A frequently used statistical application is the calculation of all possible mean differences. Assume that we have arithmetic means for the treatments trt1, trt2, trt3, trt4 and trt5, given in the object means below.
1. Calculate all possible mean differences using means as the first two arguments in outer(), and letting FUN = "-".
2. Extract meaningful and non-redundant differences by using upper.tri() or lower.tri() (Section 3.4.4). There should be ${5 \choose 2} = 10$ meaningful (not simply a mean subtracted from itself) and non-redundant differences.
```
means <- c(trt1 = 20.5, trt2 = 15.3, trt3 = 22.1, 
           trt4 = 30.4, trt5 = 28)
```
Using the plant dataset from Question 5 in the Exercises for Chapter 3, perform the following operations.
1. Use the function replace() to identify samples with soil N less than 13.5% by identifying them as "Npoor".
2. Use the function which() to identify which plant heights are greater than or equal to 33.2 dm.
3. Sort plant heights using the function sort().
4. Sort the plant dataset with respect to ascending values of plant height using the function order().

Using match() or which and %in%, replace the code column names in the dataset cliff.sp from the package asbio, with the correct scientific names (genus and specific epithet) from the dataframe sp.list below.

sp.list <- data.frame(code = c("L_ASCA","L_CLCI","L_COSPP","L_COUN",
"L_DEIN","L_LCAT", "L_LCST","L_LEDI","M_POSP","L_STDR","L_THSP",
"L_TOCA","L_XAEL","M_AMSE", "M_CRFI","M_DISP","M_WECO","P_MIGU",
"P_POAR","P_SAOD"), 
sci.name = c("Aspicilia caesiocineria","Caloplaca citrina",
"Collema spp.", "Collema undulatum", "Dermatocarpon intestiniforme",
"Lecidea atrobrunnea", "Lecidella stigmatea", "Lecanora dispersa",
"Pohlia sp.", "Staurothele drummondii", "Thelidium species",
"Toninia candida", "Xanthoria elegans", "Amblystegium serpens",
"Cratoneuron filicinum", "Dicranella species", "Weissia controversa",
"Mimulus guttatus", "Poa pattersonii", "Saxifraga odontoloma"))

Using the sp.list dataframe from the previous question, perform the following operations.
1. Apply strsplit() to the the column sp.list$sci.name to create a two column dataframe with genus and corresponding species names. This will require saving the list resulting from application of strsplit() as an object, e.g., ss, and applying: data.frame(do.call("rbind", ss)).
2. A one character prefix in the column sp.list$code indicates whether a taxon is a lichen (prefix = "L_"), a moss (prefix = "M_"), or a vascular plant (prefix = "P_"). Use grep() to identify mosses.
Use the string vector string below to answer the following questions.
1. Use regular expressions in the pattern argument of gsub() to get rid of extra spaces at the start of string elements while preserving spaces between words. Save the resulting two element character vector as an object, e.g., str2 and apply paste(str2, collapse = "") to produce a single character string, containing the correctly formatted phrase.
2. Save the result in (a) as an object e.g., q7a. Use the predefined character class [[:alnum:]] and an accompanying quantifier in the pattern from grepl() to count the number of words whose length is greater than or equal to four alphanumeric characters. This will require that you first split words into separate strings, using, for instance, unlist(strsplit(q7a, " ").
```
string <- c("   Statistics is ", "  fun.")
```
Remove the numbers from the character vector below using gsub() and an appropriate Perl-like regular expression.
```
x <- c("enzyme1","enzyme12","enzyme3","tRNA1","tRNA205",
   "mRNA6","mRNA17","mRNA8","mRNA100") 
```
Consider the character vector times below, which has the format: day-month-year hour:minute:second.
1. Convert times into an object of class POSIXlt called time.pos using the function strptime().
2. Extract the day of the week from time.pos.
3. Sort time.pos using sort() to verify that time.pos is quantitative.
```
times <- c("12-12-2023 12:12:20",
       "12-01-2021 01:12:40",
       "15-10-2021 23:10:15",
       "25-07-2022 13:09:45")
```

3 Data Objects, Packages, and Datasets

5 Welcome to the Tidyverse

4 Basic Data Management

4.1 Operations on Arrays, Lists and Vectors

4.1.1 The apply Family of Functions

4.1.1.1 apply()

4.1.1.2 lapply()

4.1.1.3 sapply()

4.1.1.4 tapply()

4.1.2 outer()

4.1.3 stack(), unstack() and reshape()

4.2 Other Simple Data Management Functions

4.2.1 replace()

4.2.2 which()

4.2.3 sort()

4.2.4 rank()

4.2.5 order()

4.2.6 unique()

4.2.7 match()

4.2.8 which() and %in%

4.3 Matching, Querying and Substituting in Strings

4.3.1 strtrim() and substr()

4.3.2 strsplit()

4.3.3 Regular Expressions

4.3.3.1 grep() and grepl()

4.3.3.2 gsub()

4.3.3.3 gregexpr()

4.3.3.4 Extended Regular Expressions

Marking/grouping sub-expressions

Backslashes and regex

Character set

Turning off regular expressions

4.3.3.5 Perl-Compatible Regular Expressions

4.4 Date-Time Classes

Exercises

4.1.1 The `apply` Family of Functions

4.1.1.1 `apply()`

4.1.1.2 `lapply()`

4.1.1.3 `sapply()`

4.1.1.4 `tapply()`

4.1.2 `outer()`

4.1.3 `stack()`, `unstack()` and `reshape()`

4.2.1 `replace()`

4.2.2 `which()`

4.2.3 `sort()`

4.2.4 `rank()`

4.2.5 `order()`

4.2.6 `unique()`

4.2.7 `match()`

4.2.8 `which()` and `%in%`

4.3.1 `strtrim()` and `substr()`

4.3.2 `strsplit()`

4.3.3.1 `grep()` and `grepl()`

4.3.3.2 `gsub()`

4.3.3.3 `gregexpr()`