4 Basic Data Management
“I think, therefore I R.”
- William B. King, Psychologist and R enthusiast
\(\text{}\)
An important characteristic of R is its capacity to efficiently manage and analyze large, complex, datasets. In this chapter I list a few functions and approaches useful for data management in base R. Data management considerations for the tidyverse are given in Chapter 5.
4.1 Operations on Arrays, Lists and Vectors
Operators can be applied individually to every row or column of an array, or every component of a list or atomic vector using a number of time saving methods.
4.1.1 The apply Family of Functions
4.1.1.1 apply()
Operations can be performed quickly on rows and columns of two dimensional arrays with the function apply(). The function requires three arguments.
- The first argument,
X, specifies an array to be analyzed. - The second argument,
MARGIN, connotes whether rows or columns are to be analyzed.MARGIN = 1indicates rows,MARGIN = 2indicates columns, whereasMARGIN = c(1, 2)indicates rows and columns. - The third argument,
FUN, defines a function to be applied to the margins of the object in the first argument.
Example 4.1
Consider the asbio::bats dataset which contains forearm length data, in millimeters, for northern myotis bats (Myotis septentrionalis), along with corresponding bat ages in in days.
days forearm.length
1 1 10.5
2 1 11.0
3 1 12.3
4 1 13.7
5 1 14.2
6 1 14.8
Here we obtain minimum values for the days and forearm.length columns.
apply(bats, 2, min) days forearm.length
1.0 10.5
It is straightforward to change the third argument in apply() to obtain different summaries, like the mean.
apply(bats, 2, mean) days forearm.length
13.579 23.603
or the standard deviation
apply(bats, 2, sd) days forearm.length
12.4610 8.4347
Several summary statistical functions exist for numerical arrays that can be used in some instances in the place of apply(). These include rowMeans() and colMeans() which give the sample means of specified rows and columns, respectively, and rowSums() and colSums() which give the sums of specified rows and columns, respectively. For instance:
colMeans(bats) days forearm.length
13.579 23.603
\(\blacksquare\)
4.1.1.2 lapply()
The function lapply() allows one to sweep functions through list components. It has two main arguments:
- The first argument,
X, specifies a list to be analyzed. - The second argument,
FUN, defines a function to be applied to each element inX.
Example 4.2
Consider the following simple list, whose three components have different lengths.
$a
[1] 1 2 3 4 5 6 7 8
$norm.obs
[1] -0.38580 2.07110 1.74799 1.30948 0.60496 0.24503 -1.20552
[8] 0.83978 0.52344 0.50883
$logic
[1] TRUE TRUE FALSE FALSE
Here we sweep the function mean() through the list:
lapply(x, mean)$a
[1] 4.5
$norm.obs
[1] 0.62593
$logic
[1] 0.5
Note the Boolean outcomes in logic have been coerced to numeric outcomes. Specifically, TRUE = 1 and FALSE = 0. Here are the 1st, 2nd (median), and 3rd quartiles of x:
lapply(x, quantile, probs = 1:3/4)$a
25% 50% 75%
2.75 4.50 6.25
$norm.obs
25% 50% 75%
0.31098 0.56420 1.19206
$logic
25% 50% 75%
0.0 0.5 1.0
\(\blacksquare\)
4.1.1.3 sapply()
The function sapply() is a user friendly wrapper for lapply() that can return a vector or array instead of a list.
sapply(x, quantile, probs = 1:3/4) a norm.obs logic
25% 2.75 0.31098 0.0
50% 4.50 0.56420 0.5
75% 6.25 1.19206 1.0
4.1.1.4 tapply()
The tapply() function allows summarization of a one dimensional array (e.g., a column or row from a matrix) with respect to levels in a categorical variable. The function requires three arguments.
- The first argument,
X, defines a one dimensional array to be analyzed. - The second argument,
INDEXshould provide a list of one or more factors (see example below) with the same length asX. - The third argument,
FUN, is used to specify a function to be applied toXfor each level inINDEX.
Example 4.3 \(\text{}\)
Consider the dataset asbio::heart, which documents pulse rates for twenty four subjects at four time periods following administration of a experimental treatment. These were two active heart medications and a control. Here are average heart rates for the treatments.
AX23 BWW9 Ctrl
76.281 81.031 71.906
Below are the mean heart rates for treatments, for each time frame. Note that the second argument is defined as a list with two components, each of which can be coerced to be a factor.
time
drug t1 t2 t3 t4
AX23 70.50 80.500 81.000 73.125
BWW9 81.75 84.000 78.625 79.750
Ctrl 72.75 72.375 71.500 71.000
\(\blacksquare\)
The function aggregate() can be considered a more sophisticated extension of tapply(). It allows objects under consideration to be expressed as functions of explanatory factors, and contains additional arguments for data specification and time series analyses.
Example 4.4 \(\text{}\)
Here we use aggregate() to get identical (but reformatted) results to the prior example.
aggregate(rate ~ drug + time, mean, data = heart) drug time rate
1 AX23 t1 70.500
2 BWW9 t1 81.750
3 Ctrl t1 72.750
4 AX23 t2 80.500
5 BWW9 t2 84.000
6 Ctrl t2 72.375
7 AX23 t3 81.000
8 BWW9 t3 78.625
9 Ctrl t3 71.500
10 AX23 t4 73.125
11 BWW9 t4 79.750
12 Ctrl t4 71.000
Importantly, the first argument, rate ~ drug + time is an object of class formula:
[1] "formula"
The tilde operator, ~, allows expression of the formulaic framework: y ~ model, where y is a response variable and model specifies a system of (generally) one or more predictor variables.
Objects of class formula have base type language:
typeof(f.rate)[1] "language"
The language base type is used for unevaluated expressions other than constants and names. Examples include formulae, and local function assignments.
\(\blacksquare\)
4.1.2 outer()
Another important function for matrix operations is outer(). This algorithm allows creation of an array that contains all possible combinations of two atomic vectors or arrays with respect to a user-specified function. The outer() function has three required arguments.
- The first two arguments,
XandY, define arrays or atomic vectors.XandYcan be identical if one wishes to examine pairwise operations of the array elements (see example below). - The third argument,
FUN, specifies a function to be used in operations.
Example 4.5 \(\text{}\)
Suppose I wish to find the means of all possible pairs of observations from a numerical vector. I could use the following commands:
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0 1.5 2.0 3.0 2.5
[2,] 1.5 2.0 2.5 3.5 3.0
[3,] 2.0 2.5 3.0 4.0 3.5
[4,] 3.0 3.5 4.0 5.0 4.5
[5,] 2.5 3.0 3.5 4.5 4.0
The argument FUN = "+" indicates that we wish to add elements to each other. We divide these sums by two to obtain means. Note that the diagonal of the output matrix contains the original elements of x, because the mean of a number and itself is the original number. The upper and lower triangles are identical because the mean of elements a and b will be the same as the mean of the elements b and a. Note that the matrix outer product of two vectors x and y can be obtained using outer(x, y, "*") or simply outer(x, y) (Section 3.1.2.1).
outer(x, x, "*") [,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 5 4
[2,] 2 4 6 10 8
[3,] 3 6 9 15 12
[4,] 5 10 15 25 20
[5,] 4 8 12 20 16
x %o% x [,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 5 4
[2,] 2 4 6 10 8
[3,] 3 6 9 15 12
[4,] 5 10 15 25 20
[5,] 4 8 12 20 16
\(\blacksquare\)
4.1.3 stack(), unstack() and reshape()
When manipulating lists and dataframes it is often useful to move between so-called “long” and “wide” data table formats. These operations can be handled with the functions stack() and unstack(). Specifically, stack() concatenates multiple vectors into a single vector along with a factor indicating where each observation originated, whereas unstack() reverses this process.
Example 4.6 \(\text{}\)
Consider the 4 x 4 dataframe below.
dataf <- data.frame(matrix(nrow = 4, data = rnorm(16)))
names(dataf) <- c("col1", "col2", "col3", "col4")
dataf col1 col2 col3 col4
1 -0.08447 0.71142 -1.768630 1.27205
2 -1.04033 0.75820 0.082965 -0.34800
3 1.49289 -0.37739 -0.614255 0.62426
4 0.80935 -1.15284 1.032906 1.04655
Here I stack dataf into a long table format.
sdataf <- stack(dataf)
sdataf values ind
1 -0.084470 col1
2 -1.040328 col1
3 1.492894 col1
4 0.809353 col1
5 0.711422 col2
6 0.758196 col2
7 -0.377393 col2
8 -1.152839 col2
9 -1.768630 col3
10 0.082965 col3
11 -0.614255 col3
12 1.032906 col3
13 1.272046 col4
14 -0.348002 col4
15 0.624257 col4
16 1.046555 col4
Here I unstack sdataf.
unstack(sdataf) col1 col2 col3 col4
1 -0.08447 0.71142 -1.768630 1.27205
2 -1.04033 0.75820 0.082965 -0.34800
3 1.49289 -0.37739 -0.614255 0.62426
4 0.80935 -1.15284 1.032906 1.04655
The function reshape() can handle both stacking and unstacking operations. Here I stack dataf. The arguments timevar, idvar, and v.names are used to provide recognizable identifiers for the columns in the wide table format, observations within those columns, and responses for those combinations.
reshape(dataf, direction = "long",
varying = list(names(dataf)),
timevar = "Column",
idvar = "Column obs.",
v.names = "Response") Column Response Column obs.
1.1 1 -0.084470 1
2.1 1 -1.040328 2
3.1 1 1.492894 3
4.1 1 0.809353 4
1.2 2 0.711422 1
2.2 2 0.758196 2
3.2 2 -0.377393 3
4.2 2 -1.152839 4
1.3 3 -1.768630 1
2.3 3 0.082965 2
3.3 3 -0.614255 3
4.3 3 1.032906 4
1.4 4 1.272046 1
2.4 4 -0.348002 2
3.4 4 0.624257 3
4.4 4 1.046555 4
\(\blacksquare\)
4.2 Other Simple Data Management Functions
4.2.1 replace()
One use the function replace() to replace elements in an atomic vector based, potentially, on Boolean logic. The function requires three arguments.
- The first argument,
x, specifies the vector to be analyzed. - The second argument,
list, connotes which elements need to be replaced. A logical argument can be used here as a replacement index. - The third argument,
values, defines the replacement value(s).
Example 4.7 \(\text{}\)
For instance:
[1] "R is Cool" "R is Cool" "25" "26" "R is Cool" "R is Cool"
Of course, one can also use square brackets for this operation.
Age[Age < 25] <- "R is Cool"
Age[1] "R is Cool" "R is Cool" "25" "26" "R is Cool" "R is Cool"
\(\blacksquare\)
4.2.2 which()
The function which can be used with logical commands to obtain address indices for data storage object.
Example 4.8
For instance:
[1] 1 2 5 6
Elements one, two, and five meet this criterion. We can now subset based on the index w.
Age[w][1] 21 19 18 19
To find which element in Age is closest to 24 I could do something like:
[1] 3
\(\blacksquare\)
4.2.3 sort()
By default, The function sort() sorts data from an atomic vector into an alphanumeric ascending order.
sort(Age)[1] 18 19 19 21 25 26
Data can be sorted in a descending order by specifying decreasing = TRUE.
sort(Age, decreasing = T)[1] 26 25 21 19 19 18
4.2.4 rank()
The function rank gives the ascending alphanumeric rank of elements in a vector. Ties are given the average of their ranks. This operation is important to rank-based permutation analyses .
rank(Age)[1] 4.0 2.5 5.0 6.0 1.0 2.5
The second and last observations were the second smallest in Age. Thus, their average rank is 2.5.
4.2.5 order()
The function order() is similar to which() in that it provides element indices that accord with an alphanumeric ordering. This allows one to sort a vector, matrix or dataframe into an ascending or descending order, based on one or several ordered vectors.
Example 4.9 \(\text{}\)
Consider the dataframe below which lists plant percent cover data for four plant species at three sites. In accordance with the field.data example from Ch 3, plant species are identified with four letter codes, corresponding to the first two letters of the taxa genus and species names.
field.data <- data.frame(code = c("ACMI", "ELSC", "CAEL", "TACE"),
site1 = c(12, 13, 14, 11),
site2 = c(0, 20, 4, 5),
site3 = c(20, 10, 30, 0))
field.data code site1 site2 site3
1 ACMI 12 0 20
2 ELSC 13 20 10
3 CAEL 14 4 30
4 TACE 11 5 0
Assume that we wish to sort the data with respect to an alphanumeric ordering of species codes. Here we obtain the ordering of the codes
o <- order(field.data$code)
o[1] 1 3 2 4
Now we can sort the rows of field.data based on this ordering.
field.data[o,] code site1 site2 site3
1 ACMI 12 0 20
3 CAEL 14 4 30
2 ELSC 13 20 10
4 TACE 11 5 0
\(\blacksquare\)
4.2.6 unique()
To identify unique values in dataset we can use the function unique().
Example 4.10
Below is an atomic vector listing species from a bird survey on islands in southeast Alaska. Species ciphers follow the same coding method used in Example 4.9. Note that there are a large number of repeats.
AK.bird <- c("GLGU", "MEGU", "DOCO", "PAJA", "COLO", "BUFF", "COGO",
"WHSC", "TUSW", "GRSC", "GRTE", "REME", "BLOY", "REPH",
"SEPL", "LESA", "ROSA", "WESA", "WISN", "BAEA", "SHOW",
"GLGU", "MEGU", "PAJA", "DOCO", "GRSC", "GRTE", "BUFF",
"MADU", "TUSW", "REME", "SEPL", "REPH", "ROSA", "LESA",
"COSN", "BAEA", "ROHA")
length(AK.bird)[1] 38
Applying unique() we obtain a listing of the 24 unique bird species.
unique(AK.bird) [1] "GLGU" "MEGU" "DOCO" "PAJA" "COLO" "BUFF" "COGO" "WHSC" "TUSW" "GRSC"
[11] "GRTE" "REME" "BLOY" "REPH" "SEPL" "LESA" "ROSA" "WESA" "WISN" "BAEA"
[21] "SHOW" "MADU" "COSN" "ROHA"
\(\blacksquare\)
4.2.7 match()
Given two vectors, the function match() indexes where objects in the second vector appear in the elements of the first vector. For instance:
[1] 5 NA 3 4 2 1
The number 2 (the 1st element in y) is the 5th element of x, thus the number 5 is put 1st in the vector m created by match. The number 1 (the 2nd element of y) does not occur in x (it is NA). The number 4 is the 3rd element of y and x. Thus, the number 3 is given as the third element of m, and so on.
Example 4.11 \(\text{}\)
The usefulness of match() may seem unclear at first, but consider a scenario in which I want to convert species code identifiers in field data into actual species names. The following dataframe is a species list that matches four letter species codes to scientific names. Note that the list contains more species than than the field.data dataset used in Example 4.9.
species.list <- data.frame(code = c("ACMI", "ASFO", "ELSC", "ERRY", "CAEL",
"CAPA", "TACE"), names = c("Achillea millefolium", "Aster foliaceus",
"Elymus scribneri", "Erigeron rydbergii",
"Carex elynoides", "Carex paysonis",
"Taraxacum ceratophorum"))
species.list code names
1 ACMI Achillea millefolium
2 ASFO Aster foliaceus
3 ELSC Elymus scribneri
4 ERRY Erigeron rydbergii
5 CAEL Carex elynoides
6 CAPA Carex paysonis
7 TACE Taraxacum ceratophorum
Here I add a column in the field.data containing the actual species names using match().
m <- match(field.data$code, species.list$code)
field.data.new <- field.data # make a copy of field data
field.data.new$species.name <- species.list$names[m]
field.data.new code site1 site2 site3 species.name
1 ACMI 12 0 20 Achillea millefolium
2 ELSC 13 20 10 Elymus scribneri
3 CAEL 14 4 30 Carex elynoides
4 TACE 11 5 0 Taraxacum ceratophorum
\(\blacksquare\)
4.2.8 which() and %in%
We can use the operator %in% in conjunction with the function which() to achieve the same results as match().
m <- which(species.list$code %in% field.data$code)
field.data.new$species.name <- species.list$names[m]
field.data.new code site1 site2 site3 species.name
1 ACMI 12 0 20 Achillea millefolium
2 ELSC 13 20 10 Elymus scribneri
3 CAEL 14 4 30 Carex elynoides
4 TACE 11 5 0 Taraxacum ceratophorum
Note that the arrangement of arguments are reversed in match() and which(). In the former we have: . In the latter we have: which(species.list$code %in% field.data$code).
4.3 Matching, Querying and Substituting in Strings
R contains a number of useful methods for handling character string data. Strings will have class and base type character.
4.3.1 strtrim() and substr()
The functions strtrim() and substr() are useful for extracting subsets from strings or string vectors.
Example 4.12 \(\text{}\)
For the taxonomic codes in the character vector below, the first capital letter indicates whether a species is a flowering plant (anthophyte) or moss (bryophyte) while the last four letters give species codes (see Example 4.9).
plant <- c("A_CAAT", "B_CASP", "A_SARI")Assume that I want to distinguish anthophytes from bryophytes by extracting the first letter. This can be done by specifying 1 in the second strtrim argument, width.
phylum <- strtrim(plant, 1)
phylum[1] "A" "B" "A"
plant[phylum == "A"][1] "A_CAAT" "A_SARI"
The function substr() is useful for imposing the start and end of strings to be subset. Here I extract string components 3-4 (the first two letters of the genus name).
substr(plant, 3, 4)[1] "CA" "CA" "SA"
\(\blacksquare\)
4.3.2 strsplit()
The function strsplit() splits a character string into substrings based on user defined criteria. It contains two important arguments.
- The first argument,
x, specifies the character string to be analyzed. - The second argument,
split, is a character criterion that is used for splitting.
Example 4.13 \(\text{}\)
Below I split the character string ACMI in two, based on the space between the words Achillea and millefolium.
ACMI <- "Achillea millefolium"
strsplit(ACMI, " ")[[1]]
[1] "Achillea" "millefolium"
Note that the result is a list. To get back to a vector (now with two components), I can use the function unlist().
[1] "Achillea" "millefolium"
Here I split based on the letter "l".
strsplit(ACMI, "l")[[1]]
[1] "Achi" "" "ea mi" "" "efo" "ium"
Interestingly, letting the split criterion equal NULL results in spaces being placed between every character in a string.
strsplit(ACMI, NULL)[[1]]
[1] "A" "c" "h" "i" "l" "l" "e" "a" " " "m" "i" "l" "l" "e" "f" "o" "l"
[18] "i" "u" "m"
We can use this outcome to reverse the order of characters in a string.
[1] "muilofellim aellihcA"
The function rev() provides a reversed version of its first argument, in this case a result from strsplit(). The function paste() can be use to paste together character strings.
\(\blacksquare\)
Criteria for querying strings can include multiple characters in a particular order, and a particular case:
x <- "R is free software and comes with ABSOLUTELY NO WARRANTY"
strsplit(x, "so")[[1]]
[1] "R is free "
[2] "ftware and comes with ABSOLUTELY NO WARRANTY"
Note that the "SO" in "ABSOLUTELY" is ignored because it is upper case.
4.3.3 Regular Expressions
A number of R functions for managing character strings, including grep(), grepl(), gregexpr(), gsub(), and strsplit(), can incorporate regular expressions. In computer programming, a regular expression (often abbreviated as regex) is a sequence of characters that allow pattern matching in text. Regular expressions have developed within a number of programming frameworks including the POSIX standard (the Portable Operating System Interface standard), developed by the IEEE, and particularly the language Perl. Regular expressions in R include extended regular expressions (this is the default for most pattern matching and replacement R functions), and Perl-like regular expressions. Useful regex command guidance for these frameworks can be found here.
4.3.3.1 grep() and grepl()
The functions grep() and grepl() can be used to identify which elements in a character vector have a specified pattern. The functions have the same first two arguments.
- The first argument,
patternspecifies a patterns to be matched. This can be a character string, an object coercible to a character string, or a regular expression. - The second argument,
x, is a character vector where matches are sought.
Example 4.14 \(\text{}\)
The function grep() returns indices identifying which entries in a vector contain a queried pattern. In the character vector below, we see that entries five and six have the same genus, Carex.
names = c("Achillea millefolium", "Aster foliaceus",
"Elymus scribneri", "Erigeron rydbergii",
"Carex elynoides", "Carex paysonis",
"Taraxacum ceratophorum")
grep("Carex", names)[1] 5 6
The function grepl() does the same thing with Boolean outcomes.
grepl("Carex", names)[1] FALSE FALSE FALSE FALSE TRUE TRUE FALSE
Of course, we could use this information to subset names.
names[grep("Carex", names)][1] "Carex elynoides" "Carex paysonis"
We can also get grep to return the values directly by specifying value = TRUE.
grep("Carex", names, value = TRUE)[1] "Carex elynoides" "Carex paysonis"
\(\blacksquare\)
4.3.3.2 gsub()
The function gsub() can be used to substitute text that has a specified pattern. Several of its arguments are identical to grep() and grepl():
- As before, the first argument,
pattern, specifies a pattern to be matched. - The second argument,
replacement, specifies a replacement for the matched pattern. - The third argument,
x, is a character vector wherein matches are sought and substitutions are made.
Example 4.15 \(\text{}\)
Here we substitute "C." for occurrences of "Carex" in names.
gsub("Carex", "C.", names)[1] "Achillea millefolium" "Aster foliaceus"
[3] "Elymus scribneri" "Erigeron rydbergii"
[5] "C. elynoides" "C. paysonis"
[7] "Taraxacum ceratophorum"
\(\blacksquare\)
4.3.3.3 gregexpr()
The function gregexpr() identifies the start and end of matching sections in a character vector, potentially using regular expressions.
Example 4.16 \(\text{}\)
Here we examine the first two entries in names, looking for the genus Aster.
[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
[[2]]
[1] 1
attr(,"match.length")
[1] 5
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
The output list is cryptic and requires some explanation. The first two elements in each of the two list components indicate the character number of the start and end of the matched string. For the first list component, these elements are given the identifier -1 because "Achillea millefolium" does not contain the pattern "Aster". For the second list component, these elements are 1 and 5 because "Aster" makes up the first five letters of "Aster foliaceus".
\(\blacksquare\)
4.3.3.4 Extended Regular Expressions
Default extended regular expressions in R use a POSIX framework for commands61, which includes the use of particular metacharacters. These are: \, |, ( ), [ ], ^, $, ., { }, *, +, and ?. These metacharacters will vary in meaning depending if they occur outside or inside of square brackets, [ and ]. The former usage means that the metacharacters are part of a character class (see below). In the latter (non-bracketed) usage, the metacharacters in the subset below have the following applications (see https://www.pcre.org/original/pcre.txt):
-
^start of string or line. -
$end of string or line. -
.match any character except newline. -
|start of alternative branch. -
( )start and end subpattern. -
{ }start and end min/max repetition specification.
Several regular expression metacharacters can be placed at the end of a regular expression to specify types of repetition. For instance, "*" indicates the preceding pattern should be matched zero or more times, "{+}" indicates the preceding pattern should be matched one or more times, "{n}" indicates the preceding pattern should be matched exactly n more times ("Hel{2}o" matches with “Hello”), and "{n,}" indicates the preceding pattern should be matched n or more times. The ? character matches a preceding pattern 0 or 1 times.
Example 4.17 \(\text{}\)
We will use the function regmatches(), which extracts or replaces matched substrings from gregexpr() summaries, to illustrate.
string <- "%aaabaaab"
ID <- gregexpr("a{1}", string)
regmatches(string, ID)[[1]]
[1] "a" "a" "a" "a" "a" "a"
ID <- gregexpr("a?", string)
regmatches(string, ID)[[1]]
[1] "" "a" "a" "a" "" "a" "a" "a" ""
ID <- gregexpr("a{2}", string)
regmatches(string, ID)[[1]]
[1] "aa" "aa"
ID <- gregexpr("a{2,}", string)
regmatches(string, ID)[[1]]
[1] "aaa" "aaa"
\(\blacksquare\)
Example 4.18 \(\text{}\)
Metacharacters can be used together. For instance, the code below demonstrates how one might get rid of an unwanted characters, and delete one or more extra spaces at the end of character strings.
string <- c("###Nothing in biology ",
"# makes sense except ",
"#",
" in the light",
"### of evolution (Dobzhansky). ")
out <- gsub(" +$", "", string) # drop extra space(s) at end of strings
out <- gsub("^#*","", out) # drop unwanted pound sign(s)
paste(out, collapse = "")[1] "Nothing in biology makes sense except in the light of evolution (Dobzhansky)."
\(\blacksquare\)
Example 4.19 \(\text{}\)
Microbial “taxa” identifiers can include cryptic Amplicon Sequence Variant (ASV) codes, followed by a general taxonomic assignment. For example, here is an ASV identifier for a bacterium within the family Comamonadaceae.
asv <- "6abc517aa40e9e7b9c652902fe04bb1a:f__Comamonadaceae"We can delete the ASV code, which ends in a colon, with:
gsub(".*:", "", asv)[1] "f__Comamonadaceae"
The regex script in the first argument means: “match any character occurring zero or more times within a string that ends in :”.
\(\blacksquare\)
Marking/grouping sub-expressions
When using gsub(), we can use parentheses ( and ) to mark up to nine string sub-expressions. These components can then be modified individually using numbered back-references, reflecting the order that the sub-expressions occur in the string. The numbering in the back-references require double back-slashes. That is, the back-reference \\1 refers to the first sub-expression.
Example 4.20 \(\text{}\)
Here we create a famous quote by repeating two defined sub-expression from a string.
gsub("(.*) (.*)", "The name is \\2, \\1 \\2." , "James Bond")[1] "The name is Bond, James Bond."
By specifying "(.*) (.*)" in the first (pattern) argument of gsub(), the first and second sub-expressions are defined to be a string of (essentially) any characters, of any length, separated by a white space. Thus, "James" (from "James Bond") is defined as the first sub-expression, and "Bond" is defined to be the second sub-expression. These are manipulated in the second (replacement) in gsub() using additional text, and numeric back-references.
\(\blacksquare\)
Backslashes and regex
An undesirable side-effect of using regular expressions in R is the need to use double backslashes, \\, or even quadruple backslashes, \\\\, in queries62. This is due to the fact that the backslash character is often used programatically for two (opposite) purposes (Haddock and Dunn 2011).
- First, one can use a backslash to escape a character. That is, to insure that a program sees the character literally (and not as a metacharacter with a special meaning).
Example 4.21 \(\text{}\)
For instance, to search for the presence of the actual ^ character (which is also a regex metacharacter), in the string "E = mc^2", I would have to do some thing like:
grepl("\\^", "E = mc^2")[1] TRUE
The first (inner) backslash escapes the ^ character and the second (outer) backslash escapes the first backslash (which is also a regex metacharacter).
R regex queries for literal backslashes in a string requires that those backslashes are doubled in the character vector argument, x, where matches are sought (to escape the escape) and quadrupled in the pattern argument:
gsub("\\\\", " ", "This\\is\\a\\backslash")[1] "This is a backslash"
\(\blacksquare\)
Example 4.22 \(\text{}\)
As another example, Markdown delimits monospace “code” font (i.e., font) using accent grave metacharacters, ` `, while the LaTeX language applies this font between the expression \texttt{ and }. Below I convert a R LaTex-style character vector containing some strings to an Markdown character vector.
char.vec <- c("\texttt{+}", "addition", "$2 + 2$", "\texttt{2 + 2}")
gsub("(\texttt\\{)(.*)(\\})","`\\2`", char.vec)[1] "`+`" "addition" "$2 + 2$" "`2 + 2`"
With the code…
I subset R Markdown strings in char.vec into three potential components using the regex group marker metacharacters ( and ). First, \texttt\\{ designates the beginning of monospace “code” font in LaTex. Note that the curly brace metacharacter in the snippet is double escaped. Second, the text to be monospace formatted within \texttt{} is specified, flexibly, to be any character, of any length, using .*. Third, the (double escaped) closing curly brace is given //}. The three three components of the query, specified internally as \\1 \\2 and \\3 can be replaced individually using the second (replacement) argument in gsub(). In particular, to replace LaTeX text that has monospace code formatting, with equivalent text formatting in Markdown, I specify: "`\\2`". This means “place the contents from the second component” from the first (pattern) argument of gsub(), and place it between accent grave metacharacters.
Notably, when defining a regex substitution outside of R, the back-references: \\1, \\2, etc., are 1) generally stated more simply as \1, \2, etc., and 2) generally given as a single statement along with their corresponding sub-expressions.
\(\blacksquare\)
Example 4.23 \(\text{}\)
To reverse this process, i.e., to go from Markdown to LaTeX, I have:
[1] "\texttt{+}" "addition" "$2 + 2$" "\texttt{2 + 2}"
Note that the replacement string is less demanding with respect to escape characters than the pattern argument. Specifically, although double backslashes were required to escape the curly brace, { }, metacharacters in the pattern argument of the previous example: (\texttt\\{)(.*)(\\}), they were not required to specify writing { and } in the replacement string: \texttt{\\2} of the current example. Inclusion of extra backslashes generally does not adversely affect regex queries. For instance, I could have used replacement string: \\\texttt\\{\\2\\} above and gotten the same result. Thus, whenever one desires that a character be viewed literally in a regex process it is not a bad idea to escape it (double escape in R).
\(\blacksquare\)
- Second, one can occasionally use a backslash to impart a special meaning to a character.
For example, some ASCII commands can be initiated by placing a backslash in front of a character. These include
\t, which denotes the horizontal tab character (ASCII code 9),\n, the new-line/line-feed character (ASCII code 10), and\eis the ASCII escape character (ASCII code 27) (see Section 12.8). These literals can be called with their single inherent backslash in base R regex procedures (see Example 4.24 below). Additionally, recall (Section 3.9) that UTF-8 characters (which include ASCII characters) can be called in R using a (single) backslash, followed by the characteru(upper or lower case), and the Unicode hexadecimal number (see Example 4.25). Unicode characters and hexadecimal encoding are formally considered in Ch 12. In the context of regex, one can define wildcards (special operations that can potentially match a pattern more than once in a string) using a preceding backslash. For example, the regex wildcard operation\d{+}would match occurrences of one or more digits in a string (see Section 4.3.3.5 below). In base R regex calls, however, two preceding backslashes are required for wildcards. That is, one would specify\\d{+}instead of\d{+}.
Example 4.24 \(\text{}\)
Here I query tabs in a string.
gsub("\t", "This is a tab. ", "\t\t\t")[1] "This is a tab. This is a tab. This is a tab. "
\(\blacksquare\)
Example 4.25 \(\text{}\)
Here I print the unicode character for \(\mu\):
cat("The Unicode cipher for the Greek letter \u00B5 is \\u00B5.")The Unicode cipher for the Greek letter µ is \u00B5.
Note that I doubled the backslashes to print a single backslash using the base “concatenate and print” function cat().
\(\blacksquare\)
Use of sequential backslashes for defining wildcards (and escaping regex metacharacters) will generally be unnecessary when using regular expressions outside of R. For instance, when specifying regex wildcards from a system shell (Section 9.2).
Character set
A regular expression character set is comprised of a collection of characters, specifying some query or pattern, situated between quotes (single or double) and square brace metacharacters, e.g., "[" and "]". Character set pattern matches will be evaluated for any single character in the specified text. The reverse will occur if the first character of the pattern is the regular expression caret metacharacter, ^. For example, the expression "[0-9]" matches any single numeric character in a string, (the regular expression metacharacter - can be used to specify a range) and [^abc] matches anything except the characters "a", "b" or "c".
Example 4.26 \(\text{}\)
Here I ask for a string split based on the appearance of ? (which is a regex metacharacter) and % (which is not), using a character set.
string <- "m?2%b"
strsplit(string, "[\\?%]")[[1]]
[1] "m" "2" "b"
\(\blacksquare\)
Example 4.27 \(\text{}\)
Consider the following examples:
string <- "a1c&m2%b"
strsplit(string, "[0-9]") [[1]]
[1] "a" "c&m" "%b"
strsplit(string, "[^abc]") [[1]]
[1] "a" "c" "" "" "" "b"
\(\blacksquare\)
Example 4.28 \(\text{}\)
This regular expression will match most email addresses:
pattern <- "[-a-z0-9_.%]+\\@[-a-z0-9_.%]+\\.[a-z]+"The expression literally reads: “1) find one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 2) the ampersand symbol (literally), followed by 3) one or more occurrences of characters in a-z or A-Z or 0-9 or dashes or periods, followed by 4) a literal period, followed by one or more occurrences of the letters a-z or A-Z.” Here is a string we wish to query:
string <- c("abc_noboby@isu.edu",
"text with no email",
"me@mything.com",
"also",
"you@yourspace.com",
"@you"
)We confirm that elements 1, 3, and 5 from string are email addresses.
grep(pattern, string, ignore.case = TRUE, value = TRUE)[1] "abc_noboby@isu.edu" "me@mything.com" "you@yourspace.com"
\(\blacksquare\)
Certain character sets are predefined. These character classes have names that are bounded by two square brackets and colons, and include "[[:lower:]]" and "[[:upper:]]" which identify lower and upper case letters, "[[:punct:]]" which identifies punctuation, [[:alnum:]], which identifies all alphanumeric characters, and "[[:space:]]", which identifies space characters, e.g., tab and newline. Outside of R, one can generally call regex character sets and character classes using the simpler format: [pattern] and [:pattern:], respectively.
[1] TRUE TRUE FALSE FALSE FALSE
grepl("[[:upper:]]", string)[1] TRUE FALSE FALSE FALSE FALSE
grepl("[[:punct:]]", string)[1] FALSE FALSE TRUE TRUE FALSE
grepl("[[:space:]]", string) # item five is a newline request[1] FALSE FALSE FALSE FALSE TRUE
Here I ask R to return elements from string that are three or more characters long.
grep("[[:alnum:]]{3}", string, value = TRUE)[1] "M2Ab" "def"
Turning off regular expressions
For some pattern matching and replacement jobs it may be best turn off the default extended regular expressions and use exact matching by specifying fixed = TRUE. For example, R may place periods in the place of spaces in character strings and column names in dataframes and arrays.
Example 4.29 \(\text{}\)
Consider the following example:
countries <- c("United.States", "United.Arab.Emirates", "China", "Germany")
gsub(".", " ", countries)[1] " " " " " "
[4] " "
Note that using gsub(".", " ", countries) results in the replacement of all text with spaces because of the meaning of the period regex metacharacter. To get the desired result we could use:
gsub(".", " ", countries, fixed = TRUE)[1] "United States" "United Arab Emirates" "China"
[4] "Germany"
Of course we could also (double) escape the period.
gsub("\\.", " ", countries)[1] "United States" "United Arab Emirates" "China"
[4] "Germany"
\(\blacksquare\)
4.3.3.5 Perl-like Regular Expressions
index[terms]{Regular expression!Perl-like regex}
The R character string functions grep(), grepl(), regexpr(), gregexpr(), sub(), gsub(), and strsplit() allow Perl-like regular expression pattern matching. This is done by specifying perl = TRUE, which switches regular expression handling to the PRCE package. Perl allows handling of the POSIX predefined character classes, e.g., "[[:lower:]]", along with a wide variety of other calls which are generally implemented using (double) backslashes, which , when combined with certain characters, creates a wildcard command. Here are some examples.
-
\\dany digit character (equivalent to[[:digit:]]). -
\\Dany character that is not a digit (equivalent to[^[:digit:]]). -
\\hany horizontal white space character (e.g., tab, space). -
\\Hany character that is not a horizontal white space character. -
\\sany white space character (equivalent to[[:space:]]). -
\\Sany character that is not a white space character (equivalent to[^[:space:]]). -
\\vany vertical white space character (e.g., newline). -
\\Vany character that is not a vertical white space character. -
\\wany word character, (i.e,a-z,A-Z,0-9, and_). Equivalent to[[:alnum:]_]. -
\\Wany non word character (equivalent to[^[:alnum:]_]). -
\\ba word boundary. -
\\Uupper case character (dependent on context). -
\\Llower case character (dependent on context).
Note that reversals in meaning occur for capitalized and uncapitalized commands. Many other Perl-like modifications can be made to R regex functions (see ?regex.)
Example 4.30 \(\text{}\)
Here we identify string entries containing numbers.
string <- c("Acidobacteria", "Actinobacteria", "TM7.1", "Gitt-GS-136",
"Chloroflexia", "Bacili")
grep("\\d", string, perl = TRUE)[1] 3 4
And those containing non-numeric characters (i.e., all of the entries).
grep("\\D", string, perl = TRUE)[1] 1 2 3 4 5 6
To subset non-numeric entries, one could do something like:
string[-grep("\\d", string, perl = TRUE)][1] "Acidobacteria" "Actinobacteria" "Chloroflexia" "Bacili"
\(\blacksquare\)
Example 4.31 \(\text{}\)
As a slightly extended example we will count the number of words in the description of the GNU public licences in R (obtained via RShowDoc("COPYING")). Ideas here largely follow from the function DescTools::StrCountW() (Signorell 2023).
Below I use readLines() to read in GNU copying policy, and print the first six lines.
[1] "\t\t GNU GENERAL PUBLIC LICENSE"
[2] "\t\t Version 2, June 1991"
[3] ""
[4] " Copyright (C) 1989, 1991 Free Software Foundation, Inc."
[5] " 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA"
[6] " Everyone is permitted to copy and distribute verbatim copies"
To search for words in the object GNU, we will actually identify string components that are not words, identified with the Perl regex wildcard \\W. We will also word boundaries with the wildcard \\b. We will combine these wildcards as: \\b\\W+\\b. The call \\W+ indicates a non-word match occurring one or more times. Here we apply the regular expression to the first element (line) of GNU.
GNU[1][1] "\t\t GNU GENERAL PUBLIC LICENSE"
gregexpr("\\b\\W+\\b", GNU[1], perl = TRUE)[[1]]
[1] 10 18 25
attr(,"match.length")
[1] 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Recall that \t in the output above represents the ASCII control character for tab return.
Matches to our query occur at three locations, 10, 18, and 25 in line 1 of GNU. These separate the four words GNU GENERAL PUBLIC LICENSE. Thus, to analyze the entire document we could use:
[1] 3048
There are 3048 total words in the license description.
\(\blacksquare\)
One can identify substrings by number using a Perl-like approach.
Example 4.32 \(\text{}\)
In this example, I subdivide a string into two components, the first character, i.e., "(\\w)", and the remaining zero or more characters: "(\\w*)". These are referred to in the replacement argument of gsub as items \\1 and \\2, respectively. Capitalization for these substrings are handled in different ways below.
string <- "achillea"
gsub("(\\w)(\\w*)", "\\U\\1\\U\\2", string, perl=TRUE) # all caps[1] "ACHILLEA"
gsub("(\\w)(\\w*)", "\\L\\1\\U\\2", string, perl=TRUE) # low, then upper case[1] "aCHILLEA"
gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", string, perl=TRUE) # up, then lower case[1] "Achillea"
The functions tolower() and toupper() provide simpler approaches to convert letters to lower and upper case, respectively.
toupper(string)[1] "ACHILLEA"
\(\blacksquare\)
4.4 Date-Time Classes
There are two basic R date-time classes, POSIXlt and POSIXct63. Class POSIXct represents the (signed) number of seconds since the beginning of 1970 (in the UTC time zone) as a numeric vector. An object of class POSIXlt will be comprised of a list of vectors with the names sec, min, hour, mday (day of month), mon (month), year, wday (day of week), and yday (day of year).
POSIX naming conventions include:
-
%m= Month as a decimal number (01–12). -
%d= Day of the month as a decimal number (01–31). -
%Y= Year. Designations in0:9999are accepted. -
%H= Hour as a decimal number (00–23). -
%M= Minute as a decimal number (00–59
Example 4.33 \(\text{}\)
As an example, below are twenty dates and corresponding binary water presence measures (0 = water absent, 1 = water present) recorded at 2.5 hour intervals for an intermittent stream site in southwest Idaho (K. Aho, Derryberry, et al. 2023).
dates <- c("08/13/2019 04:00", "08/13/2019 06:30", "08/13/2019 09:00",
"08/13/2019 11:30", "08/13/2019 14:00", "08/13/2019 16:30",
"08/13/2019 19:00", "08/13/2019 21:30", "08/14/2019 00:00",
"08/14/2019 02:30", "08/14/2019 05:00", "08/14/2019 07:30",
"08/14/2019 10:00", "08/14/2019 12:30", "08/14/2019 15:00",
"08/14/2019 17:30", "08/14/2019 20:00", "08/14/2019 22:30",
"08/15/2019 01:00", "08/15/2019 03:30")
pres.abs <- c(1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1)To convert the character string dates to a date-time object we can use the function strptime(). We have:
[1] "POSIXlt" "POSIXt"
Note that the dates can now be evaluated numerically.
dates.df <- data.frame(dates = dates.ts, pres.abs = pres.abs)
summary(dates.df) dates pres.abs
Min. :2019-08-13 04:00:00 Min. :0.00
1st Qu.:2019-08-13 15:52:30 1st Qu.:0.75
Median :2019-08-14 03:45:00 Median :1.00
Mean :2019-08-14 03:45:00 Mean :0.75
3rd Qu.:2019-08-14 15:37:30 3rd Qu.:1.00
Max. :2019-08-15 03:30:00 Max. :1.00
I can also easily extract time series components.
dates.ts$mday # day of month [1] 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 15 15
dates.ts$wday # day of week [1] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4
dates.ts$hour # hour [1] 4 6 9 11 14 16 19 21 0 2 5 7 10 12 15 17 20 22 1 3
\(\blacksquare\)
Exercises
-
Using the
plantdataset from Question 5 in the Exercises at the end of Chapter 3, perform the following operations.- Attempt to simultaneously calculate the column means for plant height and soil % N using
FUN = meaninapply(). Was there an issue? Why? - Eliminate missing rows in
plantusingna.omit()and repeat (a). Did this change the mean for plant height? Why? - Modify the
FUNargument inapply()to be:FUN = function(x) mean(x, na.rm = TRUE). This will eliminateNAs on a column by column basis. - Compare the results in (a), (b), (c). Which is the best approach? Why?
- Find the mean and variance of plant heights for each Management Type in
plantusingtapply(). Use the best practice approach forFUN, as deduced in (d).
- Attempt to simultaneously calculate the column means for plant height and soil % N using
-
For the questions below, use the list object
list.data.- Use
sapply(list.data, FUN = length)to get the number of components in each element oflist.data. - Repeat (a) using
lapply(). How is the output in (b) different from (a)?
- Use
-
A frequently used statistical application is the calculation of all possible mean differences. Assume that we have arithmetic means for the treatments
trt1,trt2,trt3,trt4andtrt5, given in the objectmeansbelow.- Calculate all possible mean differences using
meansas the first two arguments inouter(), and lettingFUN = "-". - Extract meaningful and non-redundant differences by using
upper.tri()orlower.tri()(Section 3.4.4). There should be \({5 \choose 2} = 10\) meaningful (not simply a mean subtracted from itself) and non-redundant differences.
means <- c(trt1 = 20.5, trt2 = 15.3, trt3 = 22.1, trt4 = 30.4, trt5 = 28) - Calculate all possible mean differences using
-
Using the
plantdataset from Question 5 in the Exercises for Chapter 3, perform the following operations.- Use the function
replace()to identify samples with soil N less than 13.5% by identifying them as"Npoor". - Use the function
which()to identify which plant heights are greater than or equal to 33.2 dm. - Sort plant heights using the function
sort(). - Sort the
plantdataset with respect to ascending values of plant height using the functionorder().
- Use the function
-
Using
match()orwhichand%in%, replace the code column names in the datasetcliff.spfrom the package asbio, with the correct scientific names (genus and specific epithet) from the dataframesp.listbelow.sp.list <- data.frame(code = c("L_ASCA","L_CLCI","L_COSPP","L_COUN", "L_DEIN","L_LCAT", "L_LCST","L_LEDI","M_POSP","L_STDR","L_THSP", "L_TOCA","L_XAEL","M_AMSE", "M_CRFI","M_DISP","M_WECO","P_MIGU", "P_POAR","P_SAOD"), sci.name = c("Aspicilia caesiocineria","Caloplaca citrina", "Collema spp.", "Collema undulatum", "Dermatocarpon intestiniforme", "Lecidea atrobrunnea", "Lecidella stigmatea", "Lecanora dispersa", "Pohlia sp.", "Staurothele drummondii", "Thelidium species", "Toninia candida", "Xanthoria elegans", "Amblystegium serpens", "Cratoneuron filicinum", "Dicranella species", "Weissia controversa", "Mimulus guttatus", "Poa pattersonii", "Saxifraga odontoloma")) -
Using the
sp.listdataframe from the previous question, perform the following operations.- Apply
strsplit()to the the columnsp.list$sci.nameto create a two column dataframe with genus and corresponding species names. - A two character prefix in the column
sp.list$codeindicates whether a taxon is a lichen (prefix ="L_"), a marchantiophyte (prefix ="M_"), or a vascular plant (prefix ="P_"). Usegrep()to identify marchantiophytes.
- Apply
7. Use the string vector string below to answer the following questions.
(a) Use regular expressions in the pattern argument of gsub() to get rid of extra spaces at the start of string elements while preserving spaces between words.
(b) Use the predefined character class [[:alnum:]] and an accompanying quantifier in the pattern argument from grep() to count the number of words whose length is greater than or equal to four characters.
``` r
string <- c(" Statistics is ", " a ", " great topic.")
```
-
Remove the numbers from the character vector below using
gsub()and an appropriate Perl-like regular expression.x <- c("enzyme1","enzyme12","enzyme3","tRNA1","tRNA205", "mRNA6","mRNA17","mRNA8","mRNA100") -
Consider the character vector
timesbelow, which has the format:day-month-year hour:minute:second.- Convert
timesinto an object of classPOSIXltcalledtime.posusing the functionstrptime(). - Extract the day of the week from
time.pos. - Sort
time.posusingsort()to verify thattime.posis quantitative.
times <- c("12-12-2023 12:12:20", "12-01-2021 01:12:40", "15-10-2021 23:10:15", "25-07-2022 13:09:45") - Convert