Lapply-loops
As examples for the apply family we shortly introduce lapply
.
Basic structure
The lapply
function can be your workhorse when it comes to loops over data frames but it can also drive you mad if you do not understand its way of handling stuff.
Remember: the function will always return a structured list with one return statement per sub-entry and the function can never change a data frame or any other variable content defined outside the loop (no matter if it has the same name or not).
Let’s have a look at an example for which we will use the following data frame:
a <- c("A", "B", "C", "A", "B", "A", "A")
b <- c("X", "X", "X", "X", "Y", "Y", "Y")
c <- c(1, 2, 3, 4, 5, 6, 7)
d <- c(10, 20, 30, 40, 50, 60, 70)
df <- data.frame(Cat1 = a, Cat2 = b, Val1 = c, Val2 = d)
We now want to build a new data frame with the same columns but the characters of columns Cat1 and Cat2 should be converted to lower case and the values of Val1 and Val2 should become their square root. This is the same task as used in the for loop code examples.
In order to solve this problem with an lapply
call, we have to define an appropriate function within the () brackets of the lapply
call structure. Since we want to iterate over all (four) columns of the data frame, we define the iteration sequence using the ncol
function which in that specific case is the same as seq(4)
since we have four columns:
result <- lapply(seq(ncol(df)), function(x){
act_column <- df[,x]
if(is.factor(act_column)){
return(tolower(act_column))
} else if(is.numeric(act_column)){
return(sqrt(act_column))
}
})
As you can see, the function we want to apply is passed within the outer () brackets and starts with the { bracket in the first line. The actual body of the function (i.e. what controls what is done within it) starts in the second line and ends in line six. The function is closed in line seven with the closing } bracket and the lapply
function call is closed direclty afterwards with the closing ) bracket.
Let’s have a look on what is returned, i.e. what is stored in the variable called result:
## [[1]]
## [1] "a" "b" "c" "a" "b" "a" "a"
##
## [[2]]
## [1] "x" "x" "x" "x" "y" "y" "y"
##
## [[3]]
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##
## [[4]]
## [1] 3.162278 4.472136 5.477226 6.324555 7.071068 7.745967 8.366600
str(result)
## List of 4
## $ : chr [1:7] "a" "b" "c" "a" ...
## $ : chr [1:7] "x" "x" "x" "x" ...
## $ : num [1:7] 1 1.41 1.73 2 2.24 ...
## $ : num [1:7] 3.16 4.47 5.48 6.32 7.07 ...
Obviously it is a list with four sub-elements and each element contains a vector which holds the modified content of one of the data frame rows. Since we originally wanted a data frame, we have to convert this list to a data frame using:
result_df <- data.frame(result)
result_df
## c..a....b....c....a....b....a....a..
## 1 a
## 2 b
## 3 c
## 4 a
## 5 b
## 6 a
## 7 a
## c..x....x....x....x....y....y....y..
## 1 x
## 2 x
## 3 x
## 4 x
## 5 y
## 6 y
## 7 y
## c.1..1.4142135623731..1.73205080756888..2..2.23606797749979..
## 1 1.000000
## 2 1.414214
## 3 1.732051
## 4 2.000000
## 5 2.236068
## 6 2.449490
## 7 2.645751
## c.3.16227766016838..4.47213595499958..5.47722557505166..6.32455532033676..
## 1 3.162278
## 2 4.472136
## 3 5.477226
## 4 6.324555
## 5 7.071068
## 6 7.745967
## 7 8.366600
str(result_df)
## 'data.frame': 7 obs. of 4 variables:
## $ c..a....b....c....a....b....a....a.. : Factor w/ 3 levels "a","b","c": 1 2 3 1 2 1 1
## $ c..x....x....x....x....y....y....y.. : Factor w/ 2 levels "x","y": 1 1 1 1 2 2 2
## $ c.1..1.4142135623731..1.73205080756888..2..2.23606797749979.. : num 1 1.41 1.73 2 2.24 ...
## $ c.3.16227766016838..4.47213595499958..5.47722557505166..6.32455532033676..: num 3.16 4.47 5.48 6.32 7.07 ...
Also it looks weird, it is a data frame but the column names are not nice. Let us fix this:
colnames(result_df) <- colnames(df)
str(result_df)
## 'data.frame': 7 obs. of 4 variables:
## $ Cat1: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 1 1
## $ Cat2: Factor w/ 2 levels "x","y": 1 1 1 1 2 2 2
## $ Val1: num 1 1.41 1.73 2 2.24 ...
## $ Val2: num 3.16 4.47 5.48 6.32 7.07 ...
All problems solved.
Using variables defined outside the lapply loop
If one additionally wants to include other variables defined outside the loop into the lapply-loop, no problem. Since this type of function definition can not be resused outside the loop anyway, one can just pass any variable from outside the loop to the inside. Here is one example where we use a variable called var_outside:
var_outside <- 10
result <- lapply(seq(ncol(df)), function(x){
act_column <- df[,x]
if(is.factor(act_column)){
return(tolower(act_column))
} else if(is.numeric(act_column)){
sqrt_mult <- sqrt(act_column) * var_outside
return(sqrt_mult)
}
})
result <- as.data.frame(result)
colnames(result) <- colnames(df)
result
## Cat1 Cat2 Val1 Val2
## 1 a x 10.00000 31.62278
## 2 b x 14.14214 44.72136
## 3 c x 17.32051 54.77226
## 4 a x 20.00000 63.24555
## 5 b y 22.36068 70.71068
## 6 a y 24.49490 77.45967
## 7 a y 26.45751 83.66600
Watch out: Regarding the utilization of variables defined outside the loop within the loop, one special case must be kept in mind: it is not possible to change the content of any variable defined outside the loop with the apply function.
Here is one example where we store the result of the lowercase
function in a variable called df before returning it. Remember that df is also the data frame we modify! We also include another variable called test which is used as intermediate storage for the sqrt
return:
test <- 1
result <- lapply(seq(ncol(df)), function(x){
act_column <- df[,x]
if(is.factor(act_column)){
df <- tolower(act_column)
return(df)
} else if(is.numeric(act_column)){
test <- sqrt(act_column) * var_outside
return(test)
}
})
result <- as.data.frame(result)
colnames(result) <- colnames(df)
result
## Cat1 Cat2 Val1 Val2
## 1 a x 10.00000 31.62278
## 2 b x 14.14214 44.72136
## 3 c x 17.32051 54.77226
## 4 a x 20.00000 63.24555
## 5 b y 22.36068 70.71068
## 6 a y 24.49490 77.45967
## 7 a y 26.45751 83.66600
The result is obviously correct. Now have a look at df and test:
test
## [1] 1
df
## Cat1 Cat2 Val1 Val2
## 1 A X 1 10
## 2 B X 2 20
## 3 C X 3 30
## 4 A X 4 40
## 5 B Y 5 50
## 6 A Y 6 60
## 7 A Y 7 70
Although test is used as variable name for the lower case conversion and df is used for the numeric computation, the content of the variables outside the loop has not changed.
Hence, if you realy want to change the content of a variable defined outside the loop, use a for
loop instead of apply
.
Returning data frames within the lapply loop
One last but important and heavily utilized procedure: In the above examples, the function used within lapply
returns a vector which allows an easy conversion of the overall returned nested list from the lapply
function by using the as.data.frame
function. In many cases, your function will return a data frame not a vector which makes the conversion a tiny bit more complicated.
Here is an example. In this case we combine columns Cat1 and Cat2 as well as Val1 and Val2 into new columns. Therefore it is feasible to change the control statement of the apply
function, too so it gives us the number of the individual rows (please note that this example can easily realized without any loop so take it just as an illustration):
result <- lapply(seq(nrow(df)), function(x){
new_structure <- data.frame(Col1 = paste(df[x,1], df[x,2]),
Col2 = df[x,3] * df[x,4])
return(new_structure)
})
str(result)
## List of 7
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "A X": 1
## ..$ Col2: num 10
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "B X": 1
## ..$ Col2: num 40
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "C X": 1
## ..$ Col2: num 90
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "A X": 1
## ..$ Col2: num 160
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "B Y": 1
## ..$ Col2: num 250
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "A Y": 1
## ..$ Col2: num 360
## $ :'data.frame': 1 obs. of 2 variables:
## ..$ Col1: Factor w/ 1 level "A Y": 1
## ..$ Col2: num 490
The result is a structured list with a data frame within each list element. To build an overall data frame one basically just has to copy each row of the individual data frames below each other. This is done by the following statement:
result <- do.call("rbind", result)
str(result)
## 'data.frame': 7 obs. of 2 variables:
## $ Col1: Factor w/ 5 levels "A X","B X","C X",..: 1 2 3 1 4 5 5
## $ Col2: num 10 40 90 160 250 360 490