Example: Cleaning Columns

Cleaning data frames involves quite different aspects like splitting cell entries, converting data types or the conversion of “wide” to “long” format. In general, the aim is to come up with a data frame, that has Wickham 2014:

a separate column for each variable (which has exactly one kind of information, e.g. not major and minor subject)
a separate row for each observation
a separate data frame for each kind of observation

The following examples are based on a data set showing the percentage fraction of settlement, recreational, agricultural, and forest areas for each district in Germany. The data is provided by the Regional Database Germany and formed also the basis for the visual data exploration example.

Reading data and first look

lu <- read.table("AI001_gebiet_flaeche.txt", skip = 4, header = TRUE, 
                 sep = ";", dec = ",", encoding = "latin1")
head(lu)

##      X   X.1                                            X.2
## 1 1996    DG                                    Deutschland
## 2 1996    01                             Schleswig-Holstein
## 3 1996 01001                    Flensburg, Kreisfreie Stadt
## 4 1996 01002       Kiel, Landeshauptstadt, Kreisfreie Stadt
## 5 1996 01003           Lübeck, Hansestadt, Kreisfreie Stadt
## 6 1996 01004                   Neumünster, Kreisfreie Stadt
##   Anteil.Siedlungs..und.Verkehrsfläche.an.Gesamtfl.
## 1                                              11,8
## 2                                              10,8
## 3                                              47,3
## 4                                              52,1
## 5                                              30,2
## 6                                              46,4
##   Anteil.Erholungsfläche.an.Gesamtfläche
## 1                                    0,7
## 2                                    0,7
## 3                                    5,1
## 4                                    1,3
## 5                                    2,9
## 6                                    4,9
##   Anteil.Landwirtschaftsfläche.an.Gesamtfläche
## 1                                         54,1
## 2                                         73,0
## 3                                         25,1
## 4                                         34,7
## 5                                         39,7
## 6                                         46,9
##   Anteil.Waldfläche.an.Gesamtfläche
## 1                              29,4
## 2                               9,3
## 3                               6,0
## 4                               3,3
## 5                              12,9
## 6                               3,4

Let’s have a closer look at the structure before we go on.

str(lu)

## 'data.frame':    5250 obs. of  7 variables:
##  $ X                                                : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ X.1                                              : chr  "DG" "01" "01001" "01002" ...
##  $ X.2                                              : chr  "Deutschland" "  Schleswig-Holstein" "      Flensburg, Kreisfreie Stadt" "      Kiel, Landeshauptstadt, Kreisfreie Stadt" ...
##  $ Anteil.Siedlungs..und.Verkehrsfläche.an.Gesamtfl.: chr  "11,8" "10,8" "47,3" "52,1" ...
##  $ Anteil.Erholungsfläche.an.Gesamtfläche           : chr  "0,7" "0,7" "5,1" "1,3" ...
##  $ Anteil.Landwirtschaftsfläche.an.Gesamtfläche     : chr  "54,1" "73,0" "25,1" "34,7" ...
##  $ Anteil.Waldfläche.an.Gesamtfläche                : chr  "29,4" "9,3" "6,0" "3,3" ...

As one can see,

the column names are far from optimal,
the numerical values are stored as characters, and
the “Anteil*” column names are actually values, not variable names.

Please remember: this is just one example out of an almost infinite amount of data sets. The following tasks might not be necessary at all for a specific data set. Of course, the following might also not be enough to clean a given data set.

Renaming columns

Let’s start with renaming the column names:

names(lu) <- c("Year", "ID", "Place", "Settlement", "Recreation", 
               "Agriculture", "Forest")
str(lu)

## 'data.frame':    5250 obs. of  7 variables:
##  $ Year       : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ ID         : chr  "DG" "01" "01001" "01002" ...
##  $ Place      : chr  "Deutschland" "  Schleswig-Holstein" "      Flensburg, Kreisfreie Stadt" "      Kiel, Landeshauptstadt, Kreisfreie Stadt" ...
##  $ Settlement : chr  "11,8" "10,8" "47,3" "52,1" ...
##  $ Recreation : chr  "0,7" "0,7" "5,1" "1,3" ...
##  $ Agriculture: chr  "54,1" "73,0" "25,1" "34,7" ...
##  $ Forest     : chr  "29,4" "9,3" "6,0" "3,3" ...

Converting data types

for(c in colnames(lu)[4:7]){
  print(head(unique(sort(lu[, c]))))
}

## [1] "."    "10,0" "10,1" "10,2" "10,3" "10,4"
## [1] "."   "0,0" "0,1" "0,2" "0,3" "0,4"
## [1] "."    "10,0" "10,5" "11,0" "11,3" "11,6"
## [1] "."   "0,6" "0,7" "0,8" "0,9" "1,0"

We see that “.” is also present among the numerical values. We will set cells containing just one “.” to NA first and then convert the data type to numeric.

for(c in colnames(lu)[4:7]){
  lu[, c][lu[, c] == "."] <- NA  
  lu[, c] <- as.numeric(sub(",", ".", lu[, c]))
}

str(lu)

## 'data.frame':    5250 obs. of  7 variables:
##  $ Year       : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ ID         : chr  "DG" "01" "01001" "01002" ...
##  $ Place      : chr  "Deutschland" "  Schleswig-Holstein" "      Flensburg, Kreisfreie Stadt" "      Kiel, Landeshauptstadt, Kreisfreie Stadt" ...
##  $ Settlement : num  11.8 10.8 47.3 52.1 30.2 46.4 10 9.4 9.9 10 ...
##  $ Recreation : num  0.7 0.7 5.1 1.3 2.9 4.9 0.5 0.5 1.1 1 ...
##  $ Agriculture: num  54.1 73 25.1 34.7 39.7 46.9 78.6 61 78.5 74.7 ...
##  $ Forest     : num  29.4 9.3 6 3.3 12.9 3.4 3.1 24.5 3.7 9 ...

Converting wide to long format

Finally, let’s remove any column name which is actually not a variable name, but a value from the header line and include it within the data frame. This is known as converting the data frame from “wide” to “long” format (you have to load the reshape2 library for this example first):

lu_long <- reshape2::melt(lu, id.vars = c("Year", "ID", "Place"))
head(lu_long)

##   Year    ID                                          Place   variable value
## 1 1996    DG                                    Deutschland Settlement  11.8
## 2 1996    01                             Schleswig-Holstein Settlement  10.8
## 3 1996 01001                    Flensburg, Kreisfreie Stadt Settlement  47.3
## 4 1996 01002       Kiel, Landeshauptstadt, Kreisfreie Stadt Settlement  52.1
## 5 1996 01003           Lübeck, Hansestadt, Kreisfreie Stadt Settlement  30.2
## 6 1996 01004                   Neumünster, Kreisfreie Stadt Settlement  46.4

The column (i.e. variable) names passed to the parameter id.vars are those columns which represent actual variables within their names. The columns Settlement, Recreation, Agriculture and Forest can be summarized to one column since they all represent land cover. Their values which were stored in each column have been transferred to a separate column (called value in this case).

Split multiple information within one column

The content of column “Place” could be regarded as both human readable information or additional machine readable data source. If we go for the latter, mixing names (e.g. Flensburg, Kiel) and types (e.g. city, county) is not a good option. It seems that the different kinds of information are comma separated, so let’s split the column entries using “,” as a separator and have a closer look at the results.

place <- strsplit(lu$Place, ",")
head(place)

## [[1]]
## [1] "Deutschland"
## 
## [[2]]
## [1] "  Schleswig-Holstein"
## 
## [[3]]
## [1] "      Flensburg"   " Kreisfreie Stadt"
## 
## [[4]]
## [1] "      Kiel"        " Landeshauptstadt" " Kreisfreie Stadt"
## 
## [[5]]
## [1] "      Lübeck"      " Hansestadt"       " Kreisfreie Stadt"
## 
## [[6]]
## [1] "      Neumünster"  " Kreisfreie Stadt"

max(sapply(place, length))

## [1] 3

As one can see, splitting by comma is a good option to separate the content. However, the original information obviously is comprised of either one, two or three (see result of the max function) information parts. Hence, an appropriate next step would be the compilation of a data frame with three columns and a fill-up strategy which makes sure that if

one information part is stored in the cell, the part is saved in the first column of the target data frame,
two information parts are stored in the cell, they are saved in the first and second column of the target data frame,
three information parts are stored in the cell, they are saved in the first, second and third column of the target data frame.

If the second or third column is not filled by an information part, NA is stored in the respective cells instead. Let’s do it (the gsub function is used to delete leading white spaces):

place_df <- lapply(place, function(i){
  p1 <- sub("^\\s+", "", i[1])  # Trim leading white spaces
  if(length(i) > 2){
    p2 <- sub("^\\s+", "", i[2])
    p3 <- sub("^\\s+", "", i[3])
  } else if (length(i) > 1){
    p2 <- sub("^\\s+", "", i[2])
    p3 <- NA
  } else {
    p2 <- NA
    p3 <- NA
  }
  data.frame(A = p1,
             B = p2,
             C = p3)
})
place_df <- do.call("rbind", place_df)
place_df$ID <- lu$ID 
place_df$Year <- lu$Year
head(place_df)

##                    A                B                C    ID Year
## 1        Deutschland             <NA>             <NA>    DG 1996
## 2 Schleswig-Holstein             <NA>             <NA>    01 1996
## 3          Flensburg Kreisfreie Stadt             <NA> 01001 1996
## 4               Kiel Landeshauptstadt Kreisfreie Stadt 01002 1996
## 5             Lübeck       Hansestadt Kreisfreie Stadt 01003 1996
## 6         Neumünster Kreisfreie Stadt             <NA> 01004 1996

We add the information from column ID and Year since we need it later for merging it with the long version of the land cover data frame.

While column is filled in any case, let’s have a look at the different entries in column B and C:

unique(place_df[, 2])

##  [1] NA                  "Kreisfreie Stadt"  "Landeshauptstadt" 
##  [4] "Hansestadt"        "Landkreis"         "Stat. Region"     
##  [7] "Regierungsbezirk"  "Kreis"             "krfr. Stadt"      
## [10] "Universitätsstadt" "Regionalverband"   "Stadt"            
## [13] "NUTS 2-Region"     "krsfr. Stadt"

unique(place_df[, 3])

## [1] NA                 "Kreisfreie Stadt"

Obviously, the only information stored in the third column is “Kreisfreie Stadt”. What is stored in the second column for these cases?

unique(place_df$B[!is.na(place_df$C)])

## [1] "Landeshauptstadt"  "Hansestadt"        "Universitätsstadt"

OK, if the place is one of “Landeshauptstadt”, “Hansestadt” or “Universitaetsstadt”, then the third column contains the information “Kreisfreie Stadt”. Hence, if we want to group by “Kreisfreie Stadt”, we actually would need the third column.

One strategy could be to make a structure which shows the name in the first column, the general type of the place in the second column and additional (not mandatory) information in the third column. If one follows this strategy, the content of column two and three has to be switched for the rows with a non-NA entry in the third column:

place_df[!is.na(place_df$C), ] <- place_df[!is.na(place_df$C), c(1, 3, 2, 4, 5)]

Although it might look a little weird, it is very easy and straightforward. We subset our data frame to rows which are not NA in the third column (i.e. column C) on both sides of the equal sign and define the order of column entries to 1, 3, 2 and then 4 and 5 for ID and year. This gives us what we want.

head(place_df)

##                    A                B                C    ID Year
## 1        Deutschland             <NA>             <NA>    DG 1996
## 2 Schleswig-Holstein             <NA>             <NA>    01 1996
## 3          Flensburg Kreisfreie Stadt             <NA> 01001 1996
## 4               Kiel Kreisfreie Stadt Landeshauptstadt 01002 1996
## 5             Lübeck Kreisfreie Stadt       Hansestadt 01003 1996
## 6         Neumünster Kreisfreie Stadt             <NA> 01004 1996

Finished? No, there are at least some entries without any kind of type information in column B. In fact, there are quite many of them:

unique(lu$Place[is.na(place_df$B)])

##   [1] "Deutschland"                                      
##   [2] "  Schleswig-Holstein"                             
##   [3] "  Hamburg"                                        
##   [4] "  Niedersachsen"                                  
##   [5] "  Bremen"                                         
##   [6] "  Nordrhein-Westfalen"                            
##   [7] "      Rhein-Kreis Neuss"                          
##   [8] "      Städteregion Aachen (einschl. Stadt Aachen)"
##   [9] "      Rhein-Erft-Kreis"                           
##  [10] "      Oberbergischer Kreis"                       
##  [11] "      Rheinisch-Bergischer Kreis"                 
##  [12] "      Rhein-Sieg-Kreis"                           
##  [13] "      Ennepe-Ruhr-Kreis"                          
##  [14] "      Hochsauerlandkreis"                         
##  [15] "      Märkischer Kreis"                           
##  [16] "  Hessen"                                         
##  [17] "      Hochtaunuskreis"                            
##  [18] "      Main-Kinzig-Kreis"                          
##  [19] "      Main-Taunus-Kreis"                          
##  [20] "      Odenwaldkreis"                              
##  [21] "      Rheingau-Taunus-Kreis"                      
##  [22] "      Wetteraukreis"                              
##  [23] "      Lahn-Dill-Kreis"                            
##  [24] "      Vogelsbergkreis"                            
##  [25] "      Schwalm-Eder-Kreis"                         
##  [26] "      Werra-Meißner-Kreis"                        
##  [27] "  Rheinland-Pfalz"                                
##  [28] "      Rhein-Hunsrück-Kreis"                       
##  [29] "      Rhein-Lahn-Kreis"                           
##  [30] "      Westerwaldkreis"                            
##  [31] "      Eifelkreis Bitburg-Prüm"                    
##  [32] "      Donnersbergkreis"                           
##  [33] "      Rhein-Pfalz-Kreis"                          
##  [34] "  Baden-Württemberg"                              
##  [35] "      Rems-Murr-Kreis"                            
##  [36] "      Hohenlohekreis"                             
##  [37] "      Main-Tauber-Kreis"                          
##  [38] "      Ostalbkreis"                                
##  [39] "      Neckar-Odenwald-Kreis"                      
##  [40] "      Rhein-Neckar-Kreis"                         
##  [41] "      Enzkreis"                                   
##  [42] "      Ortenaukreis"                               
##  [43] "      Schwarzwald-Baar-Kreis"                     
##  [44] "      Zollernalbkreis"                            
##  [45] "      Alb-Donau-Kreis"                            
##  [46] "      Bodenseekreis"                              
##  [47] "  Bayern"                                         
##  [48] "      Ingolstadt"                                 
##  [49] "      Rosenheim"                                  
##  [50] "      Landshut"                                   
##  [51] "      Passau"                                     
##  [52] "      Straubing"                                  
##  [53] "      Amberg"                                     
##  [54] "      Regensburg"                                 
##  [55] "      Weiden i.d.OPf."                            
##  [56] "      Bamberg"                                    
##  [57] "      Bayreuth"                                   
##  [58] "      Coburg"                                     
##  [59] "      Hof"                                        
##  [60] "      Ansbach"                                    
##  [61] "      Erlangen"                                   
##  [62] "      Fürth"                                      
##  [63] "      Nürnberg"                                   
##  [64] "      Schwabach"                                  
##  [65] "      Aschaffenburg"                              
##  [66] "      Schweinfurt"                                
##  [67] "      Würzburg"                                   
##  [68] "      Augsburg"                                   
##  [69] "      Kaufbeuren"                                 
##  [70] "      Kempten (Allgäu)"                           
##  [71] "      Memmingen"                                  
##  [72] "  Saarland"                                       
##  [73] "      Saarpfalz-Kreis"                            
##  [74] "  Berlin"                                         
##  [75] "  Brandenburg"                                    
##  [76] "  Mecklenburg-Vorpommern"                         
##  [77] "      Landkreis Bad Doberan"                      
##  [78] "      Landkreis Demmin"                           
##  [79] "      Landkreis Güstrow"                          
##  [80] "      Landkreis Ludwigslust"                      
##  [81] "      Landkreis Mecklenburg-Strelitz"             
##  [82] "      Landkreis Müritz"                           
##  [83] "      Landkreis Nordvorpommern"                   
##  [84] "      Landkreis Nordwestmecklenburg"              
##  [85] "      Landkreis Ostvorpommern"                    
##  [86] "      Landkreis Parchim"                          
##  [87] "      Landkreis Rügen"                            
##  [88] "      Landkreis Uecker-Randow"                    
##  [89] "      Landkreis Mecklenburgische Seenplatte"      
##  [90] "      Landkreis Rostock"                          
##  [91] "      Landkreis Vorpommern-Rügen"                 
##  [92] "      Landkreis Vorpommern-Greifswald"            
##  [93] "      Landkreis Ludwigslust-Parchim"              
##  [94] "  Freistaat Sachsen"                              
##  [95] "      Vogtlandkreis"                              
##  [96] "      Mittlerer Erzgebirgskreis"                  
##  [97] "      Niederschlesischer Oberlausitzkreis"        
##  [98] "      Weißeritzkreis"                             
##  [99] "      Muldentalkreis"                             
## [100] "      Erzgebirgskreis"                            
## [101] "  Sachsen-Anhalt"                                 
## [102] "      Altmarkkreis Salzwedel"                     
## [103] "      Burgenlandkreis"                            
## [104] "      Saalekreis"                                 
## [105] "      Salzlandkreis"                              
## [106] "      Saalkreis"                                  
## [107] "      Bördekreis"                                 
## [108] "      Ohrekreis"                                  
## [109] "  Thüringen"                                      
## [110] "      Wartburgkreis"                              
## [111] "      Unstrut-Hainich-Kreis"                      
## [112] "      Kyffhäuserkreis"                            
## [113] "      Ilm-Kreis"                                  
## [114] "      Saale-Holzland-Kreis"                       
## [115] "      Saale-Orla-Kreis"

sum(is.na(place_df$B))

## [1] 1180

Let’s take care of all the entries with “*kreis” first:

for(r in seq(nrow(place_df))){
  if(is.na(place_df$B[r]) &
     grepl("kreis", tolower(place_df$A[r]))){
    place_df$B[r] <- "Landkreis"
  }
}

unique(lu$Place[is.na(place_df$B)])

##  [1] "Deutschland"                                      
##  [2] "  Schleswig-Holstein"                             
##  [3] "  Hamburg"                                        
##  [4] "  Niedersachsen"                                  
##  [5] "  Bremen"                                         
##  [6] "  Nordrhein-Westfalen"                            
##  [7] "      Städteregion Aachen (einschl. Stadt Aachen)"
##  [8] "  Hessen"                                         
##  [9] "  Rheinland-Pfalz"                                
## [10] "  Baden-Württemberg"                              
## [11] "  Bayern"                                         
## [12] "      Ingolstadt"                                 
## [13] "      Rosenheim"                                  
## [14] "      Landshut"                                   
## [15] "      Passau"                                     
## [16] "      Straubing"                                  
## [17] "      Amberg"                                     
## [18] "      Regensburg"                                 
## [19] "      Weiden i.d.OPf."                            
## [20] "      Bamberg"                                    
## [21] "      Bayreuth"                                   
## [22] "      Coburg"                                     
## [23] "      Hof"                                        
## [24] "      Ansbach"                                    
## [25] "      Erlangen"                                   
## [26] "      Fürth"                                      
## [27] "      Nürnberg"                                   
## [28] "      Schwabach"                                  
## [29] "      Aschaffenburg"                              
## [30] "      Schweinfurt"                                
## [31] "      Würzburg"                                   
## [32] "      Augsburg"                                   
## [33] "      Kaufbeuren"                                 
## [34] "      Kempten (Allgäu)"                           
## [35] "      Memmingen"                                  
## [36] "  Saarland"                                       
## [37] "  Berlin"                                         
## [38] "  Brandenburg"                                    
## [39] "  Mecklenburg-Vorpommern"                         
## [40] "  Freistaat Sachsen"                              
## [41] "  Sachsen-Anhalt"                                 
## [42] "  Thüringen"

sum(is.na(place_df$B))

## [1] 420

This reduces the amount of missing type information considerably. What is left are federal states and some cities. To disentangle states from cities, we will make use of the ID column which we included in our place information. It looks like that the country has the ID “DG”, federal states have a two-digit ID, and the rest are “Kreisfreie Stadt”.

Let’s add the appropriate information in the second column:

place_df$B[place_df$ID == "DG"] <- "Land"
place_df$B[is.na(place_df$B) & nchar(place_df$ID) == 2] <- "Bundesland"
place_df$B[is.na(place_df$B)] <- "Kreisfreie Stadt"

That’s it, no more missing information in the second column.

sum(is.na(place_df$B))

## [1] 0

As the last step, we have to merge the separated place information back into the original data frame. For this example, we use the one in long format.

lu_long_final <- merge(lu_long, place_df, by = c("ID", "Year"))

For illustration purposes, we just show some rows below:

lu_long_final[c(1,50, 600),]

##        ID Year                             Place    variable value
## 1      01 1996                Schleswig-Holstein  Recreation   0.7
## 50  01001 2004       Flensburg, Kreisfreie Stadt Agriculture  28.0
## 600 01061 2014              Steinburg, Landkreis      Forest   9.4
##                      A                B    C
## 1   Schleswig-Holstein       Bundesland <NA>
## 50           Flensburg Kreisfreie Stadt <NA>
## 600          Steinburg        Landkreis <NA>

Since column “Place” is no longer needed, let’s just delete it:

lu_long_final$Place <- NULL
lu_long_final[c(1,50, 600),]

##        ID Year    variable value                  A                B    C
## 1      01 1996  Recreation   0.7 Schleswig-Holstein       Bundesland <NA>
## 50  01001 2004 Agriculture  28.0          Flensburg Kreisfreie Stadt <NA>
## 600 01061 2014      Forest   9.4          Steinburg        Landkreis <NA>