Easy Way to Cbind Dataframe With Itself to One Column
Data reshaping refers to the tasks of combining (i.e. joining) and/or transposing data sets. First, we will talk about performing joins using base R and the plyr and data table packages. Second, we will discuss data transpose using Hadley Wickham's reshape2 package. We have spoken about plyr and data table packages in previous lessons. Additional information, including a user's reference manual, for reshape2 is available here: https://cran.r-project.org/web/packages/reshape2/index.html.
COMBINING DATA USING BASE R
- cbind(): combining vectors or lists with equal number of rows. cbind() is horizontal combination of data. In the example below, we are creating four numeric vectors and then combining them into two numeric vectors.
> Unemploy_Rate.1 <- c(6.6,6.7,6.6,6.2,6.3,6.1) > PerChange_Urban_CPI.1 <-c(0.2,0.1,0.2,0.2,0.3,0.2) > macroind.1<-cbind(Unemploy_Rate.1,PerChange_Urban_CPI.1) > macroind.1 Unemploy_Rate.1 PerChange_Urban_CPI.1 [1,] 6.6 0.2 [2,] 6.7 0.1 [3,] 6.6 0.2 [4,] 6.2 0.2 [5,] 6.3 0.3 [6,] 6.1 0.2> Unemploy_Rate.2 <-c(6.2,6.1,5.9,5.7,5.8,5.6) > PerChange_Urban_CPI.2 <-c(0.1,-0.1,0.1,0.1,-0.3,-0.3) > macroind.2 <- cbind(Unemploy_Rate.2,PerChange_Urban_CPI.2) > macroind.2 Unemploy_Rate.2 PerChange_Urban_CPI.2 [1,] 6.2 0.1 [2,] 6.1 -0.1 [3,] 5.9 0.1 [4,] 5.7 0.1 [5,] 5.8 -0.3 [6,] 5.6 -0.3
- rbind(): combining vectors or lists with equal number of columns. All columns must be of the same data type. For example, character columns cannot be combined with numeric columns. rbind() is the equivalent of stacking data sets on top of each other. If the columns have different names, R will use the column names of the first vector or list specified. In the example below, we stack the numeric vectors macroind.1 and macroind.2 together.
> macroind <-rbind(macroind.1,macroind.2) > macroind Unemploy_Rate.1 PerChange_Urban_CPI.1 [1,] 6.6 0.2 [2,] 6.7 0.1 [3,] 6.6 0.2 [4,] 6.2 0.2 [5,] 6.3 0.3 [6,] 6.1 0.2 [7,] 6.2 0.1 [8,] 6.1 -0.1 [9,] 5.9 0.1 [10,] 5.7 0.1 [11,] 5.8 -0.3 [12,] 5.6 -0.3
Let's do another example with the rbind() function. In Exercise 1 of the lesson Data Cleaning Exercises, we imported the Indemnities Loss by Month file for crop year 2014 using the read.delim() function. We specified that the separator is "|". Assume for a moment that we did not specify the separator when reading in the data. The data frame would look like this instead:
> colmnt14 <- read.delim("colmnt14.txt", header=FALSE)
> head(colmnt14, 5) V1
1 2014|01|AL|001|Autauga |0021|COTTON |02|RP |H |01|Decline in Price |11| |29157.4800000000
2 2014|01|AL|001|Autauga |0021|COTTON |02|RP |H |01|Decline in Price |12| |5314.3200000000
3 2014|01|AL|001|Autauga |0021|COTTON |03|RPHPE |H |01|Decline in Price |11| |6294.0000000000
4 2014|01|AL|001|Autauga |0021|COTTON |03|RPHPE |H |11|Drought |08|AUG|2098.0000000000
5 2014|01|AL|001|Autauga |0021|COTTON |02|RP |H |11|Drought |08|AUG|348.2000000000
When R reads in the colmnt14.txt file, it considers the entire row as one long string. Consequently, there is only one column: V1. All the leading 0's are preserved. There are 137,108 rows. The data has been converted into a data frame by the read.delim() function.
Second, we are going to split each string into separate columns using the | as the separator. We then stack the output of each string (i.e. 1 row with 15 columns) on top of each other to create the data frame colmnt14.2. Instead of passing each output (1 row with 15 columns) to the rbind() function each time, we use the do.call() function to store the 137,108 row outputs in a list and then pass them to rbind() at one time. This allows for efficiency. Please see this link for more about the do.call() function: http://www.stat.berkeley.edu/~s133/Docall.html.
> colmnt14.2 <- data.frame(do.call('rbind', strsplit(as.character(colmnt14$V1),'|', fixed=TRUE)))
> head(colmnt14.2,5)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15
1 2014 01 AL 001 Autauga 0021 COTTON 02 RP H 01 Decline in Price 11 29157.4800000000
2 2014 01 AL 001 Autauga 0021 COTTON 02 RP H 01 Decline in Price 12 5314.3200000000
3 2014 01 AL 001 Autauga 0021 COTTON 03 RPHPE H 01 Decline in Price 11 6294.0000000000
4 2014 01 AL 001 Autauga 0021 COTTON 03 RPHPE H 11 Drought 08 AUG 2098.0000000000
5 2014 01 AL 001 Autauga 0021 COTTON 02 RP H 11 Drought 08 AUG 348.2000000000
- merge(): The merge() function is part of base R. It allows you to join data frames the same way that SQL does. merge() accommodates the four different types of join (similar to SQL). The merge() function has the generic form
name.of.resulting.data.frame <- merge(x=name.of.data.frame.to.be.joined, y=name.of.data.frame.to.be.joined,
by x=c("variable on which to be joined", "variable on which to be joined", etc.),
by y=c("variable on which to be joined", "variable on which to be joined", etc.))
It is important to note that the variables on which to be joined DO NOT need to have the same names in the two data sets (x and y). They can have different names and still be joined.
To illustrate how merge() handles the four different types of join, we will use the "Loss Indemnities by Month" files for crop years 2013 and 2014. You can download the loss files for all 26 available years from this link: https://drive.google.com/open?id=0B0cwsC-mhE_ESWNUazRxMktMZGs. You will need to manually download and unzip the folder. You will also need to import the data frames into R using the read.csv() function. If you need assistance with this task, see the lessons Government Data Cleaning Exercises and Import & Export Data.
First, let's remind ourselves what the two data frames look like. Note that we named the 2014 loss file as "COLMNT14" and the 2013 file as "COLMNT13." The 2013 file has slightly more rows than the 2014 file. Both data frames have identical column names.
> head(COLMNT14,5) Crop_Year_Identifier State_Code State_Abbreviation County_Code County_Name Crop_Code Crop_Name Insurance_Plan_Code 1 2014 01 AL 001 Autauga 0021 COTTON 02 2 2014 01 AL 001 Autauga 0021 COTTON 02 3 2014 01 AL 001 Autauga 0021 COTTON 03 4 2014 01 AL 001 Autauga 0021 COTTON 02 5 2014 01 AL 001 Autauga 0021 COTTON 03 Insurance_Plan_Name_Abbreviation Stage_Code Cause_of_Loss_Code Cause_of_Loss_Description Month_of_Loss Month_of_Loss_Abbreviation Indemnity_Amount 1 RP H 01 Decline in Price 12 5314.3200000000 2 RP H 01 Decline in Price 11 29157.4800000000 3 RPHPE H 01 Decline in Price 11 6294.0000000000 4 RP H 11 Drought 08 AUG 348.2000000000 5 RPHPE H 11 Drought 08 AUG 2098.0000000000 > dim(COLMNT14) [1] 137344 15> colnames(COLMNT14) [1] "Crop_Year_Identifier" "State_Code" "State_Abbreviation" "County_Code" [5] "County_Name" "Crop_Code" "Crop_Name" "Insurance_Plan_Code" [9] "Insurance_Plan_Name_Abbreviation" "Stage_Code" "Cause_of_Loss_Code" "Cause_of_Loss_Description" [13] "Month_of_Loss" "Month_of_Loss_Abbreviation" "Indemnity_Amount"
> dim(COLMNT13) [1] 137497 15
> colnames(COLMNT13) [1] "Crop_Year_Identifier" "State_Code" "State_Abbreviation" "County_Code" [5] "County_Name" "Crop_Code" "Crop_Name" "Insurance_Plan_Code" [9] "Insurance_Plan_Name_Abbreviation" "Stage_Code" "Cause_of_Loss_Code" "Cause_of_Loss_Description" [13] "Month_of_Loss" "Month_of_Loss_Abbreviation" "Indemnity_Amount"
INNER JOIN
As the illustration above shown, an inner join returns all the matching records that exist in both tables 1 and 2. In the example below, we want to see all the records in COLMNT13 (table 1) and COLMNT14 (table 2) that have matching State_Code, County_Code, Crop_Code, Insurance_Plan_Code, Stage_Code, Cause_of_Loss_Code, and Month_of_Loss. (We are not using the State_Abbreviation, County_Name, Crop_Name, Insurance_Plan_Name_Abbreviation, Cause_of_Loss_Description, and Month_of_Loss_Abbreviation since they contain the same information as the "code" columns.)
> COLMNT1314.INNERJOIN <- merge (x=COLMNT13, y=COLMNT14,
+ by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), + by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"))>
head(COLMNT1314.INNERJOIN,5) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x 1 01 001 9999 90 H 42 03 2013 AL 2 01 003 0011 02 H 31 06 2013 AL 3 01 003 0011 02 H 31 07 2013 AL 4 01 003 0020 41 UH 31 07 2013 AL 5 01 003 0021 01 H 31 07 2013 AL County_Name.x Crop_Name.x Insurance_Plan_Name_Abbreviation.x Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x 1 Autauga All Other Crops APH Freeze MAR 2 Baldwin WHEAT RP Excess Moisture/Precip/Rain JUN 3 Baldwin WHEAT RP Excess Moisture/Precip/Rain JUL 4 Baldwin PECANS PRV Excess Moisture/Precip/Rain JUL 5 Baldwin COTTON YP Excess Moisture/Precip/Rain JUL Indemnity_Amount.x Crop_Year_Identifier.y State_Abbreviation.y County_Name.y Crop_Name.y Insurance_Plan_Name_Abbreviation.y 1 10542.5600000000 2014 AL Autauga All Other Crops APH 2 7718.4000000000 2014 AL Baldwin WHEAT RP 3 9141.0000000000 2014 AL Baldwin WHEAT RP 4 34110.0000000000 2014 AL Baldwin PECANS PRV 5 4389.5000000000 2014 AL Baldwin COTTON YP Cause_of_Loss_Description.y Month_of_Loss_Abbreviation.y Indemnity_Amount.y 1 Freeze MAR 132292.0000000000 2 Excess Moisture/Precip/Rain JUN 47635.6000000000 3 Excess Moisture/Precip/Rain JUL 15626.0000000000 4 Excess Moisture/Precip/Rain JUL 2498.0000000000 5 Excess Moisture/Precip/Rain JUL 5600.0000000000>
dim(COLMNT1314.INNERJOIN) [1] 45400 23
We note several things in the above inner join output:
1. R keeps the names of all the columns in both tables that are not explicitly specified in the "by" statements.
2. There are 45,400 rows that can be matched using the seven keys specified in the "by" statements in the two tables.
The two variables that we are interested in the most from the inner join are Indemnity_Amount.x and Indemnity_Amount.y. We should rename the columns to make them more intuitive. The rename() function in the plyr package is great for this task.
> COLMNT1314.INNERJOIN<-rename(COLMNT1314.INNERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014")) > head(COLMNT1314.INNERJOIN,5) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x 1 01 001 9999 90 H 42 03 2013 AL 2 01 003 0011 02 H 31 06 2013 AL 3 01 003 0011 02 H 31 07 2013 AL 4 01 003 0020 41 UH 31 07 2013 AL 5 01 003 0021 01 H 31 07 2013 AL County_Name.x Crop_Name.x Insurance_Plan_Name_Abbreviation.x Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x 1 Autauga All Other Crops APH Freeze MAR 2 Baldwin WHEAT RP Excess Moisture/Precip/Rain JUN 3 Baldwin WHEAT RP Excess Moisture/Precip/Rain JUL 4 Baldwin PECANS PRV Excess Moisture/Precip/Rain JUL 5 Baldwin COTTON YP Excess Moisture/Precip/Rain JUL Indemnity_Amount.2013 Crop_Year_Identifier.y State_Abbreviation.y County_Name.y Crop_Name.y Insurance_Plan_Name_Abbreviation.y 1 10542.5600000000 2014 AL Autauga All Other Crops APH 2 7718.4000000000 2014 AL Baldwin WHEAT RP 3 9141.0000000000 2014 AL Baldwin WHEAT RP 4 34110.0000000000 2014 AL Baldwin PECANS PRV 5 4389.5000000000 2014 AL Baldwin COTTON YP Cause_of_Loss_Description.y Month_of_Loss_Abbreviation.y Indemnity_Amount.2014 1 Freeze MAR 132292.0000000000 2 Excess Moisture/Precip/Rain JUN 47635.6000000000 3 Excess Moisture/Precip/Rain JUL 15626.0000000000 4 Excess Moisture/Precip/Rain JUL 2498.0000000000 5 Excess Moisture/Precip/Rain JUL 5600.0000000000
Let's also say that we are not interested in keeping all the columns. We want to simplify our resulting table by keeping the seven join keys plus the two renamed columns. Here's how we would do it.
> COLMNT1314.INNERJOIN <- COLMNT1314.INNERJOIN[,c(1:7,15,23)] > head(COLMNT1314.INNERJOIN) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount.2013 Indemnity_Amount.2014 1 01 001 9999 90 H 42 03 10542.5600000000 132292.0000000000 2 01 003 0011 02 H 31 06 7718.4000000000 47635.6000000000 3 01 003 0011 02 H 31 07 9141.0000000000 15626.0000000000 4 01 003 0020 41 UH 31 07 34110.0000000000 2498.0000000000 5 01 003 0021 01 H 31 07 4389.5000000000 5600.0000000000 6 01 003 0021 02 H 31 07 153397.5000000000 3713.6000000000
LEFT OUTER JOIN
A left outer join returns all the records from the left table (table 1) plus any matching records from the right table (table 2). Let's repeat what we did above except as a left outer join this time.
Note in the codes below that all.x = TRUE. This tells R to includes all the records from the left table (x; table 1). Also note the output has 137,497 rows--which is the same number of rows as the left table (COLMNT13).
> COLMNT1314.LEFTOUTERJOIN <- merge (x=COLMNT13, y=COLMNT14, + by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), + by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), all.x=TRUE) > head(COLMNT1314.LEFTOUTERJOIN,5) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x 1 01 001 0011 01 H 31 05 2013 AL 2 01 001 0041 02 H 11 05 2013 AL 3 01 001 0041 02 H 31 06 2013 AL 4 01 001 0041 02 UH 11 05 2013 AL 5 01 001 0041 02 UH 31 06 2013 AL County_Name.x Crop_Name.x Insurance_Plan_Name_Abbreviation.x Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x 1 Autauga WHEAT YP Excess Moisture/Precip/Rain MAY 2 Autauga CORN RP Drought MAY 3 Autauga CORN RP Excess Moisture/Precip/Rain JUN 4 Autauga CORN RP Drought MAY 5 Autauga CORN RP Excess Moisture/Precip/Rain JUN Indemnity_Amount.x Crop_Year_Identifier.y State_Abbreviation.y County_Name.y Crop_Name.y Insurance_Plan_Name_Abbreviation.y Cause_of_Loss_Description.y 1 3490.0000000000 <NA> <NA> <NA> <NA> <NA> <NA> 2 1683.5000000000 <NA> <NA> <NA> <NA> <NA> <NA> 3 1683.5000000000 <NA> <NA> <NA> <NA> <NA> <NA> 4 5471.5000000000 <NA> <NA> <NA> <NA> <NA> <NA> 5 5471.5000000000 <NA> <NA> <NA> <NA> <NA> <NA> Month_of_Loss_Abbreviation.y Indemnity_Amount.y 1 <NA> <NA> 2 <NA> <NA> 3 <NA> <NA> 4 <NA> <NA> 5 <NA> <NA> > dim(COLMNT1314.LEFTOUTERJOIN) [1] 137497 23 > dim(COLMNT13) [1] 137497 15 > dim(COLMNT14) [1] 137344 15
Now let's simplify our results:
> COLMNT1314.LEFTOUTERJOIN <- rename(COLMNT1314.LEFTOUTERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014")) > COLMNT1314.LEFTOUTERJOIN <- COLMNT1314.LEFTOUTERJOIN[,c(1:7,15,23)] > head(COLMNT1314.LEFTOUTERJOIN) State_Code County_Code Crop_Code Insurance_Plan_Code 1 01 001 0011 01 2 01 001 0041 02 3 01 001 0041 02 4 01 001 0041 02 5 01 001 0041 02 6 01 001 0081 02 Stage_Code Cause_of_Loss_Code Month_of_Loss 1 H 31 05 2 H 11 05 3 H 31 06 4 UH 11 05 5 UH 31 06 6 H 01 07 Indemnity_Amount.2013 Indemnity_Amount.2014 1 3490.0000000000 <NA> 2 1683.5000000000 <NA> 3 1683.5000000000 <NA> 4 5471.5000000000 <NA> 5 5471.5000000000 <NA> 6 -1753.0400000000 <NA>
RIGHT OUTER JOIN
A right outer join returns all the records from the right table (table 2) plus any matching records from the left table (table 1). Let's repeat what we did above except as a right outer join this time.
Note in the codes below that all.y = TRUE. This tells R to includes all the records from the right table (y; table 2). Also note the output has 137,344 rows--which is the same number of rows as the right table (COLMNT14).
> COLMNT1314.RIGHTOUTERJOIN <- merge (x=COLMNT13, y=COLMNT14, + by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), + by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), all.y=TRUE) > dim(COLMNT1314.RIGHTOUTERJOIN) [1] 137344 23 > dim(COLMNT13) [1] 137497 15 > dim(COLMNT14) [1] 137344 15 > COLMNT1314.RIGHTOUTERJOIN <- rename(COLMNT1314.RIGHTOUTERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014")) > COLMNT1314.RIGHTOUTERJOIN <- COLMNT1314.RIGHTOUTERJOIN[,c(1:7,15,23)] > head(COLMNT1314.RIGHTOUTERJOIN) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount.2013 Indemnity_Amount.2014 1 01 001 0021 02 H 01 11 <NA> 29157.4800000000 2 01 001 0021 02 H 01 12 <NA> 5314.3200000000 3 01 001 0021 02 H 11 08 <NA> 348.2000000000 4 01 001 0021 03 H 01 11 <NA> 6294.0000000000 5 01 001 0021 03 H 11 08 <NA> 2098.0000000000 6 01 001 0041 02 H 11 07 <NA> 4198.8000000000
A full outer join returns all the records from both tables while joining the records that match from the left to the right table. Note the option is set to all = TRUE. This tells R to includes all the records from both tables.
> COLMNT1314.FULLOUTERJOIN <- merge (x=COLMNT13, y=COLMNT14, + by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), + by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), all=TRUE) > dim(COLMNT1314.FULLOUTERJOIN) [1] 229441 23 > dim(COLMNT13) [1] 137497 15 > dim(COLMNT14) [1] 137344 15
> COLMNT1314.FULLOUTERJOIN <- rename(COLMNT1314.FULLOUTERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014")) > COLMNT1314.FULLOUTERJOIN <- COLMNT1314.FULLOUTERJOIN[,c(1:7,15,23)] > head(COLMNT1314.FULLOUTERJOIN) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount.2013 Indemnity_Amount.2014 1 01 001 0011 01 H 31 05 3490.0000000000 <NA> 2 01 001 0021 02 H 01 11 <NA> 29157.4800000000 3 01 001 0021 02 H 01 12 <NA> 5314.3200000000 4 01 001 0021 02 H 11 08 <NA> 348.2000000000 5 01 001 0021 03 H 01 11 <NA> 6294.0000000000 6 01 001 0021 03 H 11 08 <NA> 2098.0000000000
- Reduce(): Joining more than two data frames require using the Reduce() function. The Reduce() function works by adding one additional data frame at a time. In the example below, we are joining three tables. The first two tables--COLMNT12 an COLMNT13--are initially joined together and then the Reduce() function adds in the COLMNT14 table. Here we are doing a full outer join on the three data frames.
To make things easier for ourselves, we will first rename the "Indemnity_Amount" column in the three data frames COLMNT12, COLMNT13, and COLMNT14.
> COLMNT12 <- rename(COLMNT12,c("Indemnity_Amount"="Indemnity_Amount_2012")) > COLMNT13 <- rename(COLMNT13,c("Indemnity_Amount"="Indemnity_Amount_2013")) > COLMNT14 <- rename(COLMNT14,c("Indemnity_Amount"="Indemnity_Amount_2014"))
| |||
| |||
COMBINING DATA USING plyr PACKAGE
The plyr package has a join() function that (usually) allows for faster data combination than the merge() function in base R. The drawback of plyr join() is that the columns in all the data frames must have the same names in order to be joined. An advantage of plyr join is you can simply specify the type of join you want in plain English.
resulting.data.frame <- join(x=first.table, y=second.table, by=c("variable to be joined by", "variable to be joined by", etc), type="type of join")
The type option can have one of four values: inner, left, right, or full.
Below is the inner join we did above using merge() but with plyr join() this time.
> COLMNT1314_PLYR_INNERJOIN <- join(x=COLMNT13, y=COLMNT14, by=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), type="inner") > head(COLMNT1314_PLYR_INNERJOIN) Crop_Year_Identifier State_Code State_Abbreviation County_Code County_Name Crop_Code Crop_Name Insurance_Plan_Code 1 2013 01 AL 001 Autauga 9999 All Other Crops 90 2 2013 01 AL 003 Baldwin 0011 WHEAT 02 3 2013 01 AL 003 Baldwin 0011 WHEAT 02 4 2013 01 AL 003 Baldwin 0020 PECANS 41 5 2013 01 AL 003 Baldwin 0021 COTTON 01 6 2013 01 AL 003 Baldwin 0021 COTTON 02 Insurance_Plan_Name_Abbreviation Stage_Code Cause_of_Loss_Code Cause_of_Loss_Description Month_of_Loss Month_of_Loss_Abbreviation Indemnity_Amount 1 APH H 42 Freeze 03 MAR 10542.5600000000 2 RP H 31 Excess Moisture/Precip/Rain 06 JUN 7718.4000000000 3 RP H 31 Excess Moisture/Precip/Rain 07 JUL 9141.0000000000 4 PRV UH 31 Excess Moisture/Precip/Rain 07 JUL 34110.0000000000 5 YP H 31 Excess Moisture/Precip/Rain 07 JUL 4389.5000000000 6 RP H 31 Excess Moisture/Precip/Rain 07 JUL 153397.5000000000 Crop_Year_Identifier State_Abbreviation County_Name Crop_Name Insurance_Plan_Name_Abbreviation 1 2014 AL Autauga All Other Crops APH 2 2014 AL Baldwin WHEAT RP 3 2014 AL Baldwin WHEAT RP 4 2014 AL Baldwin PECANS PRV 5 2014 AL Baldwin COTTON YP 6 2014 AL Baldwin COTTON RP Cause_of_Loss_Description Month_of_Loss_Abbreviation Indemnity_Amount 1 Freeze MAR 132292.0000000000 2 Excess Moisture/Precip/Rain JUN 47635.6000000000 3 Excess Moisture/Precip/Rain JUL 15626.0000000000 4 Excess Moisture/Precip/Rain JUL 2498.0000000000 5 Excess Moisture/Precip/Rain JUL 5600.0000000000 6 Excess Moisture/Precip/Rain JUL 3713.6000000000>
dim(COLMNT1314_PLYR_INNERJOIN) [1] 45400 23
Joining more than two data frames also requires the Reduce() function. Here we repeat the full outer join of COLMNT12, COLMNT13, and COLMNT14 data frames as was done above using the merge() function. The difference between merge() and plyr join() is that the former preserves all columns that are not named as key join variables whereas the latter does not.
> ThreeYears.plyr.Join <- Reduce(function(...) + { + join(..., by=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), type="full") + + }, list(COLMNT12, COLMNT13, COLMNT14) + ) > |
|
COMBINING DATA USING data.table
Joins can also be done with the data.table package. By converting the data frame into a data table and setting index keys, we allow R to perform a join at a faster speed than plyr join or merge() with data frames.
First, let's convert our data frames into data tables.
> COLMNT13.dt <- data.table(COLMNT13, key=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss")) > COLMNT14.dt <- data.table(COLMNT14, key=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"))
Now let's remind ourselves the dimensions of our data tables.
> dim(COLMNT13.dt) [1] 137497 15 > dim(COLMNT14.dt) [1] 137344 15
An inner join is performed using the merge() function in base R.
> dt.InnerJoin <- merge(COLMNT13.dt,COLMNT14.dt)
> dim(dt.InnerJoin) [1] 45400 23
There are two ways to do a left outer join. The first is to use the merge() function. The second way is by placing the "left table" (i.e. table 1; x) inside square brackets [ ].
> dt.LeftOuterJoin <- merge(COLMNT13.dt,COLMNT14.dt, all.x=TRUE)RIGHT OUTER JOIN
> dim(dt.LeftOuterJoin) [1] 137497 23>
dt.LeftOuterJoin <- COLMNT14.dt[COLMNT13.dt]
> dim(dt.LeftOuterJoin) [1] 137497 23
A right outer join is just the reverse of the left outer join.
> dt.RightOuterJoin <- merge(COLMNT13.dt, COLMNT14.dt, all.y=TRUE)
> dim(dt.RightOuterJoin) [1] 137344 23>
dt.RightOuterJoin <- COLMNT13.dt[COLMNT14.dt] > dim(dt.RightOuterJoin) [1] 137344 23
A full outer join is performed using the merge() function.
> dt.FullOuterJoin <- merge(COLMNT13.dt, COLMNT14.dt, all=TRUE)
> dim(dt.FullOuterJoin) [1] 229441 23
MELTING DATA: COLUMN TO ROW ORIENTATION Melting data is the process of taking column orientation and turning it into row orientation. The reshape2 package by Hadley Wickham will help us in this task. For more information about reshape2 (including a user's reference manual), please see https://cran.r-project.org/web/packages/reshape2/index.html.
In the above example, we created a data frame called "ThreeYears.Merge." Let's remind ourselves what the data frame looks like.
> head(ThreeYears.Merge) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount_2012 Indemnity_Amount_2013 Indemnity_Amount_2014 1 01 001 0011 01 H 31 05 NA 3490 NA 2 01 001 0021 02 H 01 10 952 NA NA 3 01 001 0021 02 H 01 11 NA NA 29157.48 4 01 001 0021 02 H 01 12 NA NA 5314.32 5 01 001 0021 02 H 11 08 NA NA 348.20 6 01 001 0021 03 H 01 11 NA NA 6294.00 > tail(ThreeYears.Merge) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount_2012 Indemnity_Amount_2013 Indemnity_Amount_2014 304293 56 999 9999 14 FL 55 09 NA NA 6601 304294 56 999 9999 14 FL 55 10 NA NA 27274 304295 56 999 9999 14 FL 55 12 NA 45119 14324 304296 56 999 9999 50 42 05 NA NA 20571 304297 56 999 9999 90 H 12 08 NA 6391 NA 304298 56 999 9999 90 UH 12 08 5036 NA NA
> dim(ThreeYears.Merge) [1] 304298 10
Instead of having three columns titled "Indemnity_Amount_2012", "Indemnity_Amount_2013", and "Indemnity_Amount_2014", respectively, we would like to transform the data frame so that there is only one column identifying the Indemnity Amount and another column identifying the crop year. We accomplish this task using the melt() function in reshape2.
resulting.data.frame <- melt(data.frame.to.be.melted, id.vars=c("key variable 1", "key variable 2", etc), variable.name = "identifying variable for the melted variable", value.name = "variable to be melted")
Here is how we would melt the ThreeYears.Merge data frame.
> column.to.row <- melt(ThreeYears.Merge, id.vars=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code",
"Month_of_Loss"),variable.name="Crop_Year", value.name="Indemnity_Amount")>
head(column.to.row,10) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year Indemnity_Amount 1 01 001 0011 01 H 31 05 Indemnity_Amount_2012 NA 2 01 001 0021 02 H 01 10 Indemnity_Amount_2012 952 3 01 001 0021 02 H 01 11 Indemnity_Amount_2012 NA 4 01 001 0021 02 H 01 12 Indemnity_Amount_2012 NA 5 01 001 0021 02 H 11 08 Indemnity_Amount_2012 NA 6 01 001 0021 03 H 01 11 Indemnity_Amount_2012 NA 7 01 001 0021 03 H 11 08 Indemnity_Amount_2012 NA 8 01 001 0041 02 H 11 05 Indemnity_Amount_2012 NA 9 01 001 0041 02 H 11 07 Indemnity_Amount_2012 NA 10 01 001 0041 02 H 11 08 Indemnity_Amount_2012 NA>
tail(column.to.row,10) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year Indemnity_Amount 912885 56 999 9999 14 FL 55 05 Indemnity_Amount_2014 NA 912886 56 999 9999 14 FL 55 06 Indemnity_Amount_2014 NA 912887 56 999 9999 14 FL 55 07 Indemnity_Amount_2014 NA 912888 56 999 9999 14 FL 55 08 Indemnity_Amount_2014 NA 912889 56 999 9999 14 FL 55 09 Indemnity_Amount_2014 6601 912890 56 999 9999 14 FL 55 10 Indemnity_Amount_2014 27274 912891 56 999 9999 14 FL 55 12 Indemnity_Amount_2014 14324 912892 56 999 9999 50 42 05 Indemnity_Amount_2014 20571 912893 56 999 9999 90 H 12 08 Indemnity_Amount_2014 NA 912894 56 999 9999 90 UH 12 08 Indemnity_Amount_2014 NA
We can simplify the resulting data frame further by using the str_sub() function in the stringr package:
> column.to.row$Crop_Year <- str_sub(column.to.row$Crop_Year, start=18, end=21)
> head(column.to.row,10) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year Indemnity_Amount 1 01 001 0011 01 H 31 05 2012 NA 2 01 001 0021 02 H 01 10 2012 952 3 01 001 0021 02 H 01 11 2012 NA 4 01 001 0021 02 H 01 12 2012 NA 5 01 001 0021 02 H 11 08 2012 NA 6 01 001 0021 03 H 01 11 2012 NA 7 01 001 0021 03 H 11 08 2012 NA 8 01 001 0041 02 H 11 05 2012 NA 9 01 001 0041 02 H 11 07 2012 NA 10 01 001 0041 02 H 11 08 2012 NA>
tail(column.to.row,10) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year Indemnity_Amount 912885 56 999 9999 14 FL 55 05 2014 NA 912886 56 999 9999 14 FL 55 06 2014 NA 912887 56 999 9999 14 FL 55 07 2014 NA 912888 56 999 9999 14 FL 55 08 2014 NA 912889 56 999 9999 14 FL 55 09 2014 6601 912890 56 999 9999 14 FL 55 10 2014 27274 912891 56 999 9999 14 FL 55 12 2014 14324 912892 56 999 9999 50 42 05 2014 20571 912893 56 999 9999 90 H 12 08 2014 NA 912894 56 999 9999 90 UH 12 08 2014 NA
CASTING DATA: ROW TO COLUMN ORIENTATION (WIDE FORMAT) Casting data is the process of taking row orientation and turning it into column orientation (also called a "wide format"). The function dcast() achieves this task, and it has the generic structure of
resulting.data.frame <- dcast(data.frame.to.be.casted, column 1 to remain as column + column 2 to remain as column + etc ~ column 1 to become rows + column 2 to become rows + etc, value.var = "column that has values to be populated into new columns")
Here is an example to clarify what we mean. We take the above column.to.row data frame and cast it back into a wide format.
> row.to.column <- dcast(column.to.row, State_Code + County_Code + Crop_Code + Insurance_Plan_Code + Stage_Code +
Cause_of_Loss_Code + Month_of_Loss ~ Crop_Year, value.var = "Indemnity_Amount")
> head(row.to.column,10) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss 2012 2013 2014 1 01 001 0011 01 H 31 05 NA 3490.0 NA 2 01 001 0021 02 H 01 10 952 NA NA 3 01 001 0021 02 H 01 11 NA NA 29157.48 4 01 001 0021 02 H 01 12 NA NA 5314.32 5 01 001 0021 02 H 11 08 NA NA 348.20 6 01 001 0021 03 H 01 11 NA NA 6294.00 7 01 001 0021 03 H 11 08 NA NA 2098.00 8 01 001 0041 02 H 11 05 NA 1683.5 NA 9 01 001 0041 02 H 11 07 NA NA 4198.80 10 01 001 0041 02 H 11 08 NA NA 2592.00>
tail(row.to.column,10) State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss 2012 2013 2014 304289 56 999 9999 14 FL 55 05 NA 8012 NA 304290 56 999 9999 14 FL 55 06 NA 11679 NA 304291 56 999 9999 14 FL 55 07 NA 7729 NA 304292 56 999 9999 14 FL 55 08 NA 5528 NA 304293 56 999 9999 14 FL 55 09 NA NA 6601 304294 56 999 9999 14 FL 55 10 NA NA 27274 304295 56 999 9999 14 FL 55 12 NA 45119 14324 304296 56 999 9999 50 42 05 NA NA 20571 304297 56 999 9999 90 H 12 08 NA 6391 NA 304298 56 999 9999 90 UH 12 08 5036 NA NA
coopersciusurturs.blogspot.com
Source: https://sites.google.com/site/xuanphamru/about/functions/control-statements-loops/data-reshaping
0 Response to "Easy Way to Cbind Dataframe With Itself to One Column"
Post a Comment