Easy Way to Cbind Dataframe With Itself to One Column

Data reshaping refers to the tasks of combining (i.e. joining) and/or transposing data sets. First, we will talk about performing joins using base R and the plyr and data table packages. Second, we will discuss data transpose using Hadley Wickham's reshape2 package. We have spoken about plyr and data table packages in previous lessons. Additional information, including a user's reference manual, for reshape2 is available here: https://cran.r-project.org/web/packages/reshape2/index.html.

COMBINING DATA USING BASE R

cbind(): combining vectors or lists with equal number of rows. cbind() is horizontal combination of data. In the example below, we are creating four numeric vectors and then combining them into two numeric vectors.

            >                        Unemploy_Rate.1 <- c(6.6,6.7,6.6,6.2,6.3,6.1)                        >                        PerChange_Urban_CPI.1 <-c(0.2,0.1,0.2,0.2,0.3,0.2)                        >                        macroind.1<-cbind(Unemploy_Rate.1,PerChange_Urban_CPI.1)                        >                        macroind.1                        Unemploy_Rate.1 PerChange_Urban_CPI.1 [1,]             6.6                   0.2 [2,]             6.7                   0.1 [3,]             6.6                   0.2 [4,]             6.2                   0.2 [5,]             6.3                   0.3 [6,]             6.1                   0.2              >                            Unemploy_Rate.2 <-c(6.2,6.1,5.9,5.7,5.8,5.6)                            >                            PerChange_Urban_CPI.2 <-c(0.1,-0.1,0.1,0.1,-0.3,-0.3)                            >                            macroind.2 <- cbind(Unemploy_Rate.2,PerChange_Urban_CPI.2)                            >                            macroind.2                            Unemploy_Rate.2 PerChange_Urban_CPI.2 [1,]             6.2                   0.1 [2,]             6.1                  -0.1 [3,]             5.9                   0.1 [4,]             5.7                   0.1 [5,]             5.8                  -0.3 [6,]             5.6                  -0.3

rbind(): combining vectors or lists with equal number of columns. All columns must be of the same data type. For example, character columns cannot be combined with numeric columns. rbind() is the equivalent of stacking data sets on top of each other. If the columns have different names, R will use the column names of the first vector or list specified. In the example below, we stack the numeric vectors macroind.1 and macroind.2 together.

            >                        macroind <-rbind(macroind.1,macroind.2)                        >                        macroind                        Unemploy_Rate.1 PerChange_Urban_CPI.1  [1,]             6.6                   0.2  [2,]             6.7                   0.1  [3,]             6.6                   0.2  [4,]             6.2                   0.2  [5,]             6.3                   0.3  [6,]             6.1                   0.2  [7,]             6.2                   0.1  [8,]             6.1                  -0.1  [9,]             5.9                   0.1 [10,]             5.7                   0.1 [11,]             5.8                  -0.3 [12,]             5.6                  -0.3

Let's do another example with the rbind() function. In Exercise 1 of the lesson Data Cleaning Exercises, we imported the Indemnities Loss by Month file for crop year 2014 using the read.delim() function. We specified that the separator is "|". Assume for a moment that we did not specify the separator when reading in the data. The data frame would look like this instead:

            > colmnt14 <- read.delim("colmnt14.txt", header=FALSE)
              > head(colmnt14, 5)                                                                    V1

            1 2014|01|AL|001|Autauga              |0021|COTTON              |02|RP              |H |01|Decline in Price              |11|              |29157.4800000000

            2              2014|01|AL|001|Autauga              |0021|COTTON              |02|RP              |H |01|Decline in Price              |12|              |5314.3200000000

            3              2014|01|AL|001|Autauga              |0021|COTTON              |03|RPHPE              |H |01|Decline in Price              |11|              |6294.0000000000

            4              2014|01|AL|001|Autauga              |0021|COTTON              |03|RPHPE              |H |11|Drought              |08|AUG|2098.0000000000

            5              2014|01|AL|001|Autauga              |0021|COTTON              |02|RP              |H |11|Drought              |08|AUG|348.2000000000

When R reads in the colmnt14.txt file, it considers the entire row as one long string. Consequently, there is only one column: V1. All the leading 0's are preserved. There are 137,108 rows. The data has been converted into a data frame by the read.delim() function.

Second, we are going to split each string into separate columns using the | as the separator. We then stack the output of each string (i.e. 1 row with 15 columns) on top of each other to create the data frame colmnt14.2. Instead of passing each output (1 row with 15 columns) to the rbind() function each time, we use the do.call() function to store the 137,108 row outputs in a list and then pass them to rbind() at one time. This allows for efficiency. Please see this link for more about the do.call() function: http://www.stat.berkeley.edu/~s133/Docall.html.

> colmnt14.2 <- data.frame(do.call('rbind', strsplit(as.character(colmnt14$V1),'|', fixed=TRUE)))

> head(colmnt14.2,5)

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15

1 2014 01 AL 001 Autauga 0021 COTTON 02 RP H 01 Decline in Price 11 29157.4800000000

2 2014 01 AL 001 Autauga 0021 COTTON 02 RP H 01 Decline in Price 12 5314.3200000000

3 2014 01 AL 001 Autauga 0021 COTTON 03 RPHPE H 01 Decline in Price 11 6294.0000000000

4 2014 01 AL 001 Autauga 0021 COTTON 03 RPHPE H 11 Drought 08 AUG 2098.0000000000

5 2014 01 AL 001 Autauga 0021 COTTON 02 RP H 11 Drought 08 AUG 348.2000000000

merge(): The merge() function is part of base R. It allows you to join data frames the same way that SQL does. merge() accommodates the four different types of join (similar to SQL). The merge() function has the generic form

name.of.resulting.data.frame <- merge(x=name.of.data.frame.to.be.joined, y=name.of.data.frame.to.be.joined,

by x=c("variable on which to be joined", "variable on which to be joined", etc.),

by y=c("variable on which to be joined", "variable on which to be joined", etc.))

It is important to note that the variables on which to be joined DO NOT need to have the same names in the two data sets (x and y). They can have different names and still be joined.

To illustrate how merge() handles the four different types of join, we will use the "Loss Indemnities by Month" files for crop years 2013 and 2014. You can download the loss files for all 26 available years from this link: https://drive.google.com/open?id=0B0cwsC-mhE_ESWNUazRxMktMZGs. You will need to manually download and unzip the folder. You will also need to import the data frames into R using the read.csv() function. If you need assistance with this task, see the lessons Government Data Cleaning Exercises and Import & Export Data.

First, let's remind ourselves what the two data frames look like. Note that we named the 2014 loss file as "COLMNT14" and the 2013 file as "COLMNT13." The 2013 file has slightly more rows than the 2014 file. Both data frames have identical column names.

> head(COLMNT14,5) Crop_Year_Identifier State_Code State_Abbreviation County_Code County_Name Crop_Code Crop_Name Insurance_Plan_Code 1 2014 01 AL 001 Autauga 0021 COTTON 02 2 2014 01 AL 001 Autauga 0021 COTTON 02 3 2014 01 AL 001 Autauga 0021 COTTON 03 4 2014 01 AL 001 Autauga 0021 COTTON 02 5 2014 01 AL 001 Autauga 0021 COTTON 03 Insurance_Plan_Name_Abbreviation Stage_Code Cause_of_Loss_Code Cause_of_Loss_Description Month_of_Loss Month_of_Loss_Abbreviation Indemnity_Amount 1 RP H 01 Decline in Price 12 5314.3200000000 2 RP H 01 Decline in Price 11 29157.4800000000 3 RPHPE H 01 Decline in Price 11 6294.0000000000 4 RP H 11 Drought 08 AUG 348.2000000000 5 RPHPE H 11 Drought 08 AUG 2098.0000000000 > dim(COLMNT14) [1] 137344 15

> colnames(COLMNT14) [1] "Crop_Year_Identifier" "State_Code" "State_Abbreviation" "County_Code" [5] "County_Name" "Crop_Code" "Crop_Name" "Insurance_Plan_Code" [9] "Insurance_Plan_Name_Abbreviation" "Stage_Code" "Cause_of_Loss_Code" "Cause_of_Loss_Description" [13] "Month_of_Loss" "Month_of_Loss_Abbreviation" "Indemnity_Amount"
> dim(COLMNT13) [1] 137497 15

            >                        colnames(COLMNT13)                        [1] "Crop_Year_Identifier"             "State_Code"                       "State_Abbreviation"               "County_Code"                       [5] "County_Name"                      "Crop_Code"                        "Crop_Name"                        "Insurance_Plan_Code"               [9] "Insurance_Plan_Name_Abbreviation" "Stage_Code"                       "Cause_of_Loss_Code"               "Cause_of_Loss_Description"        [13] "Month_of_Loss"                    "Month_of_Loss_Abbreviation"       "Indemnity_Amount"

INNER JOIN

As the illustration above shown, an inner join returns all the matching records that exist in both tables 1 and 2. In the example below, we want to see all the records in COLMNT13 (table 1) and COLMNT14 (table 2) that have matching State_Code, County_Code, Crop_Code, Insurance_Plan_Code, Stage_Code, Cause_of_Loss_Code, and Month_of_Loss. (We are not using the State_Abbreviation, County_Name, Crop_Name, Insurance_Plan_Name_Abbreviation, Cause_of_Loss_Description, and Month_of_Loss_Abbreviation since they contain the same information as the "code" columns.)

            >                        COLMNT1314.INNERJOIN <- merge (x=COLMNT13, y=COLMNT14,                                      
+                                      by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),            +                                      by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"))                                      >              
                        head(COLMNT1314.INNERJOIN,5)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x 1         01         001      9999                  90         H                  42            03                   2013                   AL 2         01         003      0011                  02         H                  31            06                   2013                   AL 3         01         003      0011                  02         H                  31            07                   2013                   AL 4         01         003      0020                  41         UH                 31            07                   2013                   AL 5         01         003      0021                  01         H                  31            07                   2013                   AL                    County_Name.x                    Crop_Name.x Insurance_Plan_Name_Abbreviation.x         Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x 1 Autauga                        All Other Crops                                        APH        Freeze                                                       MAR 2 Baldwin                        WHEAT                                                  RP         Excess Moisture/Precip/Rain                                  JUN 3 Baldwin                        WHEAT                                                  RP         Excess Moisture/Precip/Rain                                  JUL 4 Baldwin                        PECANS                                                 PRV        Excess Moisture/Precip/Rain                                  JUL 5 Baldwin                        COTTON                                                 YP         Excess Moisture/Precip/Rain                                  JUL   Indemnity_Amount.x Crop_Year_Identifier.y State_Abbreviation.y                  County_Name.y                    Crop_Name.y Insurance_Plan_Name_Abbreviation.y 1   10542.5600000000                   2014                   AL Autauga                        All Other Crops                                        APH        2    7718.4000000000                   2014                   AL Baldwin                        WHEAT                                                  RP         3    9141.0000000000                   2014                   AL Baldwin                        WHEAT                                                  RP         4   34110.0000000000                   2014                   AL Baldwin                        PECANS                                                 PRV        5    4389.5000000000                   2014                   AL Baldwin                        COTTON                                                 YP                   Cause_of_Loss_Description.y Month_of_Loss_Abbreviation.y Indemnity_Amount.y 1 Freeze                                                       MAR  132292.0000000000 2 Excess Moisture/Precip/Rain                                  JUN   47635.6000000000 3 Excess Moisture/Precip/Rain                                  JUL   15626.0000000000 4 Excess Moisture/Precip/Rain                                  JUL    2498.0000000000 5 Excess Moisture/Precip/Rain                                  JUL    5600.0000000000                          >              
                        dim(COLMNT1314.INNERJOIN)            [1] 45400    23

We note several things in the above inner join output:

1. R keeps the names of all the columns in both tables that are not explicitly specified in the "by" statements.

2. There are 45,400 rows that can be matched using the seven keys specified in the "by" statements in the two tables.

The two variables that we are interested in the most from the inner join are Indemnity_Amount.x and Indemnity_Amount.y. We should rename the columns to make them more intuitive. The rename() function in the plyr package is great for this task.

            >                        COLMNT1314.INNERJOIN<-rename(COLMNT1314.INNERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014"))                        >                        head(COLMNT1314.INNERJOIN,5)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x 1         01         001      9999                  90         H                  42            03                   2013                   AL 2         01         003      0011                  02         H                  31            06                   2013                   AL 3         01         003      0011                  02         H                  31            07                   2013                   AL 4         01         003      0020                  41         UH                 31            07                   2013                   AL 5         01         003      0021                  01         H                  31            07                   2013                   AL                    County_Name.x                    Crop_Name.x Insurance_Plan_Name_Abbreviation.x         Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x 1 Autauga                        All Other Crops                                        APH        Freeze                                                       MAR 2 Baldwin                        WHEAT                                                  RP         Excess Moisture/Precip/Rain                                  JUN 3 Baldwin                        WHEAT                                                  RP         Excess Moisture/Precip/Rain                                  JUL 4 Baldwin                        PECANS                                                 PRV        Excess Moisture/Precip/Rain                                  JUL 5 Baldwin                        COTTON                                                 YP         Excess Moisture/Precip/Rain                                  JUL   Indemnity_Amount.2013 Crop_Year_Identifier.y State_Abbreviation.y                  County_Name.y                    Crop_Name.y Insurance_Plan_Name_Abbreviation.y 1      10542.5600000000                   2014                   AL Autauga                        All Other Crops                                        APH        2       7718.4000000000                   2014                   AL Baldwin                        WHEAT                                                  RP         3       9141.0000000000                   2014                   AL Baldwin                        WHEAT                                                  RP         4      34110.0000000000                   2014                   AL Baldwin                        PECANS                                                 PRV        5       4389.5000000000                   2014                   AL Baldwin                        COTTON                                                 YP                   Cause_of_Loss_Description.y Month_of_Loss_Abbreviation.y Indemnity_Amount.2014 1 Freeze                                                       MAR     132292.0000000000 2 Excess Moisture/Precip/Rain                                  JUN      47635.6000000000 3 Excess Moisture/Precip/Rain                                  JUL      15626.0000000000 4 Excess Moisture/Precip/Rain                                  JUL       2498.0000000000 5 Excess Moisture/Precip/Rain                                  JUL       5600.0000000000

Let's also say that we are not interested in keeping all the columns. We want to simplify our resulting table by keeping the seven join keys plus the two renamed columns. Here's how we would do it.

            >                        COLMNT1314.INNERJOIN <- COLMNT1314.INNERJOIN[,c(1:7,15,23)]                        >                        head(COLMNT1314.INNERJOIN)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount.2013 Indemnity_Amount.2014 1         01         001      9999                  90         H                  42            03      10542.5600000000     132292.0000000000 2         01         003      0011                  02         H                  31            06       7718.4000000000      47635.6000000000 3         01         003      0011                  02         H                  31            07       9141.0000000000      15626.0000000000 4         01         003      0020                  41         UH                 31            07      34110.0000000000       2498.0000000000 5         01         003      0021                  01         H                  31            07       4389.5000000000       5600.0000000000 6         01         003      0021                  02         H                  31            07     153397.5000000000       3713.6000000000

LEFT OUTER JOIN

A left outer join returns all the records from the left table (table 1) plus any matching records from the right table (table 2). Let's repeat what we did above except as a left outer join this time.

Note in the codes below that all.x = TRUE. This tells R to includes all the records from the left table (x; table 1). Also note the output has 137,497 rows--which is the same number of rows as the left table (COLMNT13).

            >                        COLMNT1314.LEFTOUTERJOIN <- merge (x=COLMNT13, y=COLMNT14,                        +                                      by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),                        +                                      by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),              all.x=TRUE)                        >                        head(COLMNT1314.LEFTOUTERJOIN,5)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x 1         01         001      0011                  01         H                  31            05                   2013                   AL 2         01         001      0041                  02         H                  11            05                   2013                   AL 3         01         001      0041                  02         H                  31            06                   2013                   AL 4         01         001      0041                  02         UH                 11            05                   2013                   AL 5         01         001      0041                  02         UH                 31            06                   2013                   AL                    County_Name.x                    Crop_Name.x Insurance_Plan_Name_Abbreviation.x         Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x 1 Autauga                        WHEAT                                                  YP         Excess Moisture/Precip/Rain                                  MAY 2 Autauga                        CORN                                                   RP         Drought                                                      MAY 3 Autauga                        CORN                                                   RP         Excess Moisture/Precip/Rain                                  JUN 4 Autauga                        CORN                                                   RP         Drought                                                      MAY 5 Autauga                        CORN                                                   RP         Excess Moisture/Precip/Rain                                  JUN   Indemnity_Amount.x Crop_Year_Identifier.y State_Abbreviation.y County_Name.y Crop_Name.y Insurance_Plan_Name_Abbreviation.y Cause_of_Loss_Description.y 1    3490.0000000000                   <NA>                 <NA>          <NA>        <NA>                               <NA>                        <NA> 2    1683.5000000000                   <NA>                 <NA>          <NA>        <NA>                               <NA>                        <NA> 3    1683.5000000000                   <NA>                 <NA>          <NA>        <NA>                               <NA>                        <NA> 4    5471.5000000000                   <NA>                 <NA>          <NA>        <NA>                               <NA>                        <NA> 5    5471.5000000000                   <NA>                 <NA>          <NA>        <NA>                               <NA>                        <NA>   Month_of_Loss_Abbreviation.y Indemnity_Amount.y 1                         <NA>               <NA> 2                         <NA>               <NA> 3                         <NA>               <NA> 4                         <NA>               <NA> 5                         <NA>               <NA>            >                        dim(COLMNT1314.LEFTOUTERJOIN)            [1] 137497     23            >                        dim(COLMNT13)            [1] 137497     15            >                        dim(COLMNT14)            [1] 137344     15

Now let's simplify our results:

            >                        COLMNT1314.LEFTOUTERJOIN <- rename(COLMNT1314.LEFTOUTERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014"))                        >                        COLMNT1314.LEFTOUTERJOIN <- COLMNT1314.LEFTOUTERJOIN[,c(1:7,15,23)]                        >                        head(COLMNT1314.LEFTOUTERJOIN)                        State_Code County_Code Crop_Code Insurance_Plan_Code 1         01         001      0011                  01 2         01         001      0041                  02 3         01         001      0041                  02 4         01         001      0041                  02 5         01         001      0041                  02 6         01         001      0081                  02   Stage_Code Cause_of_Loss_Code Month_of_Loss 1         H                  31            05 2         H                  11            05 3         H                  31            06 4         UH                 11            05 5         UH                 31            06 6         H                  01            07   Indemnity_Amount.2013 Indemnity_Amount.2014 1       3490.0000000000                  <NA> 2       1683.5000000000                  <NA> 3       1683.5000000000                  <NA> 4       5471.5000000000                  <NA> 5       5471.5000000000                  <NA> 6      -1753.0400000000                  <NA>

RIGHT OUTER JOIN

A right outer join returns all the records from the right table (table 2) plus any matching records from the left table (table 1). Let's repeat what we did above except as a right outer join this time.
Note in the codes below that all.y = TRUE. This tells R to includes all the records from the right table (y; table 2). Also note the output has 137,344 rows--which is the same number of rows as the right table (COLMNT14).

            >                        COLMNT1314.RIGHTOUTERJOIN <- merge (x=COLMNT13, y=COLMNT14,                        +                                      by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),                        +                                      by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),              all.y=TRUE)                        >                        dim(COLMNT1314.RIGHTOUTERJOIN)            [1] 137344     23            >                        dim(COLMNT13)            [1] 137497     15            >                        dim(COLMNT14)            [1] 137344     15            >                        COLMNT1314.RIGHTOUTERJOIN <- rename(COLMNT1314.RIGHTOUTERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014"))                        >                        COLMNT1314.RIGHTOUTERJOIN <- COLMNT1314.RIGHTOUTERJOIN[,c(1:7,15,23)]                        >                        head(COLMNT1314.RIGHTOUTERJOIN)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount.2013 Indemnity_Amount.2014 1         01         001      0021                  02         H                  01            11                  <NA>      29157.4800000000 2         01         001      0021                  02         H                  01            12                  <NA>       5314.3200000000 3         01         001      0021                  02         H                  11            08                  <NA>        348.2000000000 4         01         001      0021                  03         H                  01            11                  <NA>       6294.0000000000 5         01         001      0021                  03         H                  11            08                  <NA>       2098.0000000000 6         01         001      0041                  02         H                  11            07                  <NA>       4198.8000000000

FULL OUTER JOIN

A full outer join returns all the records from both tables while joining the records that match from the left to the right table. Note the option is set to all = TRUE. This tells R to includes all the records from both tables.

            >                        COLMNT1314.FULLOUTERJOIN <- merge (x=COLMNT13, y=COLMNT14,                        +                                      by.x=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),                        +                                      by.y=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"),              all=TRUE)                        >                        dim(COLMNT1314.FULLOUTERJOIN)            [1] 229441     23            >                        dim(COLMNT13)            [1] 137497     15            >                        dim(COLMNT14)            [1] 137344     15

            >                        COLMNT1314.FULLOUTERJOIN <- rename(COLMNT1314.FULLOUTERJOIN, c("Indemnity_Amount.x" = "Indemnity_Amount.2013", "Indemnity_Amount.y" = "Indemnity_Amount.2014"))                        >                        COLMNT1314.FULLOUTERJOIN <- COLMNT1314.FULLOUTERJOIN[,c(1:7,15,23)]                        >                        head(COLMNT1314.FULLOUTERJOIN)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount.2013 Indemnity_Amount.2014 1         01         001      0011                  01         H                  31            05       3490.0000000000                  <NA> 2         01         001      0021                  02         H                  01            11                  <NA>      29157.4800000000 3         01         001      0021                  02         H                  01            12                  <NA>       5314.3200000000 4         01         001      0021                  02         H                  11            08                  <NA>        348.2000000000 5         01         001      0021                  03         H                  01            11                  <NA>       6294.0000000000 6         01         001      0021                  03         H                  11            08                  <NA>       2098.0000000000

Reduce(): Joining more than two data frames require using the Reduce() function. The Reduce() function works by adding one additional data frame at a time. In the example below, we are joining three tables. The first two tables--COLMNT12 an COLMNT13--are initially joined together and then the Reduce() function adds in the COLMNT14 table. Here we are doing a full outer join on the three data frames.

To make things easier for ourselves, we will first rename the "Indemnity_Amount" column in the three data frames COLMNT12, COLMNT13, and COLMNT14.

            >                        COLMNT12 <- rename(COLMNT12,c("Indemnity_Amount"="Indemnity_Amount_2012"))                        >                        COLMNT13 <- rename(COLMNT13,c("Indemnity_Amount"="Indemnity_Amount_2013"))                        >                        COLMNT14 <- rename(COLMNT14,c("Indemnity_Amount"="Indemnity_Amount_2014"))

                            >                                                        ThreeYears.Merge <- Reduce(function(...)                                                        +                                                        {                                                        +                                                                                      merge(..., by=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), all=TRUE)                                                        +                                                                                                                +                                                        }, list(COLMNT12, COLMNT13, COLMNT14)                                                        +                                                        )                                                        >                                                                                                                >                                                        dim(ThreeYears.Merge)                            [1] 304298     31                            >                                                                                                                >                                                        head(ThreeYears.Merge)                                                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year_Identifier.x State_Abbreviation.x                  County_Name.x                    Crop_Name.x Insurance_Plan_Name_Abbreviation.x 1         01         001      0011                  01         H                  31            05                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 2         01         001      0021                  02         H                  01            10                   2012                   AL Autauga                        COTTON                                                 RP         3         01         001      0021                  02         H                  01            11                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 4         01         001      0021                  02         H                  01            12                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 5         01         001      0021                  02         H                  11            08                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 6         01         001      0021                  03         H                  01            11                   <NA>                 <NA>                           <NA>                           <NA>                               <NA>           Cause_of_Loss_Description.x Month_of_Loss_Abbreviation.x Indemnity_Amount_2012 Crop_Year_Identifier.y State_Abbreviation.y                  County_Name.y                    Crop_Name.y Insurance_Plan_Name_Abbreviation.y 1                                <NA>                         <NA>                    NA                   2013                   AL Autauga                        WHEAT                                                  YP         2 Decline in Price                                                                   952                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 3                                <NA>                         <NA>                    NA                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 4                                <NA>                         <NA>                    NA                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 5                                <NA>                         <NA>                    NA                   <NA>                 <NA>                           <NA>                           <NA>                               <NA> 6                                <NA>                         <NA>                    NA                   <NA>                 <NA>                           <NA>                           <NA>                               <NA>           Cause_of_Loss_Description.y Month_of_Loss_Abbreviation.y Indemnity_Amount_2013 Crop_Year_Identifier State_Abbreviation                    County_Name                      Crop_Name Insurance_Plan_Name_Abbreviation 1 Excess Moisture/Precip/Rain                                  MAY                  3490                 <NA>               <NA>                           <NA>                           <NA>                             <NA> 2                                <NA>                         <NA>                    NA                 <NA>               <NA>                           <NA>                           <NA>                             <NA> 3                                <NA>                         <NA>                    NA                 2014                 AL Autauga                        COTTON                                               RP         4                                <NA>                         <NA>                    NA                 2014                 AL Autauga                        COTTON                                               RP         5                                <NA>                         <NA>                    NA                 2014                 AL Autauga                        COTTON                                               RP         6                                <NA>                         <NA>                    NA                 2014                 AL Autauga                        COTTON                                               RPHPE                  Cause_of_Loss_Description Month_of_Loss_Abbreviation Indemnity_Amount_2014 1                                <NA>                       <NA>                    NA 2                                <NA>                       <NA>                    NA 3 Decline in Price                                                            29157.48 4 Decline in Price                                                             5314.32 5 Drought                                                    AUG                348.20 6 Decline in Price                                                             6294.00                                                          >                              
                                                        ThreeYears.Merge <- ThreeYears.Merge[,c(1:7,15,23,31)]                                                        >                                                        head(ThreeYears.Merge)                                                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount_2012 Indemnity_Amount_2013 Indemnity_Amount_2014 1         01         001      0011                  01         H                  31            05                    NA                  3490                    NA 2         01         001      0021                  02         H                  01            10                   952                    NA                    NA 3         01         001      0021                  02         H                  01            11                    NA                    NA              29157.48 4         01         001      0021                  02         H                  01            12                    NA                    NA               5314.32 5         01         001      0021                  02         H                  11            08                    NA                    NA                348.20 6         01         001      0021                  03         H                  01            11                    NA                    NA               6294.00

COMBINING DATA USING plyr PACKAGE

The plyr package has a join() function that (usually) allows for faster data combination than the merge() function in base R. The drawback of plyr join() is that the columns in all the data frames must have the same names in order to be joined. An advantage of plyr join is you can simply specify the type of join you want in plain English.

resulting.data.frame <- join(x=first.table, y=second.table, by=c("variable to be joined by", "variable to be joined by", etc), type="type of join")

The type option can have one of four values: inner, left, right, or full.

Below is the inner join we did above using merge() but with plyr join() this time.

            >                        COLMNT1314_PLYR_INNERJOIN <- join(x=COLMNT13, y=COLMNT14, by=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), type="inner")            >                        head(COLMNT1314_PLYR_INNERJOIN)                        Crop_Year_Identifier State_Code State_Abbreviation County_Code                    County_Name Crop_Code                      Crop_Name Insurance_Plan_Code 1                 2013         01                 AL         001 Autauga                             9999 All Other Crops                                 90 2                 2013         01                 AL         003 Baldwin                             0011 WHEAT                                           02 3                 2013         01                 AL         003 Baldwin                             0011 WHEAT                                           02 4                 2013         01                 AL         003 Baldwin                             0020 PECANS                                          41 5                 2013         01                 AL         003 Baldwin                             0021 COTTON                                          01 6                 2013         01                 AL         003 Baldwin                             0021 COTTON                                          02   Insurance_Plan_Name_Abbreviation Stage_Code Cause_of_Loss_Code           Cause_of_Loss_Description Month_of_Loss Month_of_Loss_Abbreviation  Indemnity_Amount 1                       APH                H                  42 Freeze                                         03                        MAR  10542.5600000000 2                       RP                 H                  31 Excess Moisture/Precip/Rain                    06                        JUN   7718.4000000000 3                       RP                 H                  31 Excess Moisture/Precip/Rain                    07                        JUL   9141.0000000000 4                       PRV                UH                 31 Excess Moisture/Precip/Rain                    07                        JUL  34110.0000000000 5                       YP                 H                  31 Excess Moisture/Precip/Rain                    07                        JUL   4389.5000000000 6                       RP                 H                  31 Excess Moisture/Precip/Rain                    07                        JUL 153397.5000000000   Crop_Year_Identifier State_Abbreviation                    County_Name                      Crop_Name Insurance_Plan_Name_Abbreviation 1                 2014                 AL Autauga                        All Other Crops                                      APH        2                 2014                 AL Baldwin                        WHEAT                                                RP         3                 2014                 AL Baldwin                        WHEAT                                                RP         4                 2014                 AL Baldwin                        PECANS                                               PRV        5                 2014                 AL Baldwin                        COTTON                                               YP         6                 2014                 AL Baldwin                        COTTON                                               RP                     Cause_of_Loss_Description Month_of_Loss_Abbreviation  Indemnity_Amount 1 Freeze                                                     MAR 132292.0000000000 2 Excess Moisture/Precip/Rain                                JUN  47635.6000000000 3 Excess Moisture/Precip/Rain                                JUL  15626.0000000000 4 Excess Moisture/Precip/Rain                                JUL   2498.0000000000 5 Excess Moisture/Precip/Rain                                JUL   5600.0000000000 6 Excess Moisture/Precip/Rain                                JUL   3713.6000000000                          >              
                        dim(COLMNT1314_PLYR_INNERJOIN)            [1] 45400    23

Joining more than two data frames also requires the Reduce() function. Here we repeat the full outer join of COLMNT12, COLMNT13, and COLMNT14 data frames as was done above using the merge() function. The difference between merge() and plyr join() is that the former preserves all columns that are not named as key join variables whereas the latter does not.

                    >                                        ThreeYears.plyr.Join <- Reduce(function(...)                                        +                                        {                                        +                                                              join(..., by=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"), type="full")                                        +                                                                                +                                        }, list(COLMNT12, COLMNT13, COLMNT14)                                        +                                        )                                        >                                                                                                      >                      
                                        dim(ThreeYears.plyr.Join)                    [1] 304298     17                    >                                                                                                      >                      
                                        head(ThreeYears.plyr.Join)                                        Crop_Year_Identifier State_Code State_Abbreviation County_Code                    County_Name Crop_Code                      Crop_Name Insurance_Plan_Code Insurance_Plan_Name_Abbreviation Stage_Code Cause_of_Loss_Code 1                 2012         01                 AL         001 Autauga                             0021 COTTON                                          02                       RP                 H                  01 2                 2012         01                 AL         001 Autauga                             0041 CORN                                            02                       RP                 UH                 01 3                 2012         01                 AL         001 Autauga                             0041 CORN                                            02                       RP                 UH                 11 4                 2012         01                 AL         001 Autauga                             0041 CORN                                            02                       RP                 UH                 11 5                 2012         01                 AL         001 Autauga                             0081 SOYBEANS                                        02                       RP                 H                  11 6                 2012         01                 AL         001 Autauga                             0081 SOYBEANS                                        02                       RP                 H                  11             Cause_of_Loss_Description Month_of_Loss Month_of_Loss_Abbreviation Indemnity_Amount_2012 Indemnity_Amount_2013 Indemnity_Amount_2014 1 Decline in Price                               10                                            952.0                    NA                    NA 2 Decline in Price                               07                        JUL                  12.3                    NA                    NA 3 Drought                                        07                        JUL                 602.7                    NA                 518.5 4 Drought                                        06                        JUN                 615.0                    NA                    NA 5 Drought                                        06                        JUN               -3366.0                    NA                    NA 6 Drought                                        07                        JUL               -3267.0                    NA                    NA

COMBINING DATA USING data.table

Joins can also be done with the data.table package. By converting the data frame into a data table and setting index keys, we allow R to perform a join at a faster speed than plyr join or merge() with data frames.

First, let's convert our data frames into data tables.

            >                        COLMNT13.dt <- data.table(COLMNT13, key=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"))                        >                        COLMNT14.dt <- data.table(COLMNT14, key=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code", "Month_of_Loss"))

Now let's remind ourselves the dimensions of our data tables.

            >                        dim(COLMNT13.dt)            [1] 137497     15            >                        dim(COLMNT14.dt)            [1] 137344     15

An inner join is performed using the merge() function in base R.

            >                        dt.InnerJoin <- merge(COLMNT13.dt,COLMNT14.dt)                                      
>                        dim(dt.InnerJoin)            [1] 45400    23

There are two ways to do a left outer join. The first is to use the merge() function. The second way is by placing the "left table" (i.e. table 1; x) inside square brackets [ ].

            >                        dt.LeftOuterJoin <- merge(COLMNT13.dt,COLMNT14.dt, all.x=TRUE)                                      
>                        dim(dt.LeftOuterJoin)            [1] 137497     23                                      >              
                        dt.LeftOuterJoin <- COLMNT14.dt[COLMNT13.dt]                                      
>                        dim(dt.LeftOuterJoin)            [1] 137497     23

RIGHT OUTER JOIN

A right outer join is just the reverse of the left outer join.

            >                        dt.RightOuterJoin <- merge(COLMNT13.dt, COLMNT14.dt, all.y=TRUE)                                      
>                        dim(dt.RightOuterJoin)            [1] 137344     23                          >              
                        dt.RightOuterJoin <- COLMNT13.dt[COLMNT14.dt]                        >                        dim(dt.RightOuterJoin)            [1] 137344     23

A full outer join is performed using the merge() function.

            >                        dt.FullOuterJoin <- merge(COLMNT13.dt, COLMNT14.dt, all=TRUE)                                      
>                        dim(dt.FullOuterJoin)            [1] 229441     23

MELTING DATA: COLUMN TO ROW ORIENTATION Melting data is the process of taking column orientation and turning it into row orientation. The reshape2 package by Hadley Wickham will help us in this task. For more information about reshape2 (including a user's reference manual), please see https://cran.r-project.org/web/packages/reshape2/index.html.

In the above example, we created a data frame called "ThreeYears.Merge." Let's remind ourselves what the data frame looks like.

            >                        head(ThreeYears.Merge)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount_2012 Indemnity_Amount_2013 Indemnity_Amount_2014 1         01         001      0011                  01         H                  31            05                    NA                  3490                    NA 2         01         001      0021                  02         H                  01            10                   952                    NA                    NA 3         01         001      0021                  02         H                  01            11                    NA                    NA              29157.48 4         01         001      0021                  02         H                  01            12                    NA                    NA               5314.32 5         01         001      0021                  02         H                  11            08                    NA                    NA                348.20 6         01         001      0021                  03         H                  01            11                    NA                    NA               6294.00            >                        tail(ThreeYears.Merge)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Indemnity_Amount_2012 Indemnity_Amount_2013 Indemnity_Amount_2014 304293         56         999      9999                  14         FL                 55            09                    NA                    NA                  6601 304294         56         999      9999                  14         FL                 55            10                    NA                    NA                 27274 304295         56         999      9999                  14         FL                 55            12                    NA                 45119                 14324 304296         56         999      9999                  50                            42            05                    NA                    NA                 20571 304297         56         999      9999                  90         H                  12            08                    NA                  6391                    NA 304298         56         999      9999                  90         UH                 12            08                  5036                    NA                    NA

            >                        dim(ThreeYears.Merge)            [1] 304298     10

Instead of having three columns titled "Indemnity_Amount_2012", "Indemnity_Amount_2013", and "Indemnity_Amount_2014", respectively, we would like to transform the data frame so that there is only one column identifying the Indemnity Amount and another column identifying the crop year. We accomplish this task using the melt() function in reshape2.

resulting.data.frame <- melt(data.frame.to.be.melted, id.vars=c("key variable 1", "key variable 2", etc), variable.name = "identifying variable for the melted variable", value.name = "variable to be melted")

Here is how we would melt the ThreeYears.Merge data frame.

            >                        column.to.row <- melt(ThreeYears.Merge, id.vars=c("State_Code", "County_Code", "Crop_Code", "Insurance_Plan_Code", "Stage_Code", "Cause_of_Loss_Code",              
"Month_of_Loss"),variable.name="Crop_Year", value.name="Indemnity_Amount")                                      >              
                        head(column.to.row,10)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss             Crop_Year Indemnity_Amount 1          01         001      0011                  01         H                  31            05 Indemnity_Amount_2012               NA 2          01         001      0021                  02         H                  01            10 Indemnity_Amount_2012              952 3          01         001      0021                  02         H                  01            11 Indemnity_Amount_2012               NA 4          01         001      0021                  02         H                  01            12 Indemnity_Amount_2012               NA 5          01         001      0021                  02         H                  11            08 Indemnity_Amount_2012               NA 6          01         001      0021                  03         H                  01            11 Indemnity_Amount_2012               NA 7          01         001      0021                  03         H                  11            08 Indemnity_Amount_2012               NA 8          01         001      0041                  02         H                  11            05 Indemnity_Amount_2012               NA 9          01         001      0041                  02         H                  11            07 Indemnity_Amount_2012               NA 10         01         001      0041                  02         H                  11            08 Indemnity_Amount_2012               NA                          >              
                        tail(column.to.row,10)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss             Crop_Year Indemnity_Amount 912885         56         999      9999                  14         FL                 55            05 Indemnity_Amount_2014               NA 912886         56         999      9999                  14         FL                 55            06 Indemnity_Amount_2014               NA 912887         56         999      9999                  14         FL                 55            07 Indemnity_Amount_2014               NA 912888         56         999      9999                  14         FL                 55            08 Indemnity_Amount_2014               NA 912889         56         999      9999                  14         FL                 55            09 Indemnity_Amount_2014             6601 912890         56         999      9999                  14         FL                 55            10 Indemnity_Amount_2014            27274 912891         56         999      9999                  14         FL                 55            12 Indemnity_Amount_2014            14324 912892         56         999      9999                  50                            42            05 Indemnity_Amount_2014            20571 912893         56         999      9999                  90         H                  12            08 Indemnity_Amount_2014               NA 912894         56         999      9999                  90         UH                 12            08 Indemnity_Amount_2014               NA

We can simplify the resulting data frame further by using the str_sub() function in the stringr package:

            >                        column.to.row$Crop_Year <- str_sub(column.to.row$Crop_Year, start=18, end=21)

            >                        head(column.to.row,10)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year Indemnity_Amount 1          01         001      0011                  01         H                  31            05      2012               NA 2          01         001      0021                  02         H                  01            10      2012              952 3          01         001      0021                  02         H                  01            11      2012               NA 4          01         001      0021                  02         H                  01            12      2012               NA 5          01         001      0021                  02         H                  11            08      2012               NA 6          01         001      0021                  03         H                  01            11      2012               NA 7          01         001      0021                  03         H                  11            08      2012               NA 8          01         001      0041                  02         H                  11            05      2012               NA 9          01         001      0041                  02         H                  11            07      2012               NA 10         01         001      0041                  02         H                  11            08      2012               NA                          >              
                        tail(column.to.row,10)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss Crop_Year Indemnity_Amount 912885         56         999      9999                  14         FL                 55            05      2014               NA 912886         56         999      9999                  14         FL                 55            06      2014               NA 912887         56         999      9999                  14         FL                 55            07      2014               NA 912888         56         999      9999                  14         FL                 55            08      2014               NA 912889         56         999      9999                  14         FL                 55            09      2014             6601 912890         56         999      9999                  14         FL                 55            10      2014            27274 912891         56         999      9999                  14         FL                 55            12      2014            14324 912892         56         999      9999                  50                            42            05      2014            20571 912893         56         999      9999                  90         H                  12            08      2014               NA 912894         56         999      9999                  90         UH                 12            08      2014               NA

CASTING DATA: ROW TO COLUMN ORIENTATION (WIDE FORMAT) Casting data is the process of taking row orientation and turning it into column orientation (also called a "wide format"). The function dcast() achieves this task, and it has the generic structure of

resulting.data.frame <- dcast(data.frame.to.be.casted, column 1 to remain as column + column 2 to remain as column + etc ~ column 1 to become rows + column 2 to become rows + etc, value.var = "column that has values to be populated into new columns")

Here is an example to clarify what we mean. We take the above column.to.row data frame and cast it back into a wide format.

            >                        row.to.column <- dcast(column.to.row, State_Code + County_Code + Crop_Code + Insurance_Plan_Code + Stage_Code +              
Cause_of_Loss_Code + Month_of_Loss ~ Crop_Year, value.var = "Indemnity_Amount")                                      
>                        head(row.to.column,10)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss 2012   2013     2014 1          01         001      0011                  01         H                  31            05   NA 3490.0       NA 2          01         001      0021                  02         H                  01            10  952     NA       NA 3          01         001      0021                  02         H                  01            11   NA     NA 29157.48 4          01         001      0021                  02         H                  01            12   NA     NA  5314.32 5          01         001      0021                  02         H                  11            08   NA     NA   348.20 6          01         001      0021                  03         H                  01            11   NA     NA  6294.00 7          01         001      0021                  03         H                  11            08   NA     NA  2098.00 8          01         001      0041                  02         H                  11            05   NA 1683.5       NA 9          01         001      0041                  02         H                  11            07   NA     NA  4198.80 10         01         001      0041                  02         H                  11            08   NA     NA  2592.00                          >              
                        tail(row.to.column,10)                        State_Code County_Code Crop_Code Insurance_Plan_Code Stage_Code Cause_of_Loss_Code Month_of_Loss 2012  2013  2014 304289         56         999      9999                  14         FL                 55            05   NA  8012    NA 304290         56         999      9999                  14         FL                 55            06   NA 11679    NA 304291         56         999      9999                  14         FL                 55            07   NA  7729    NA 304292         56         999      9999                  14         FL                 55            08   NA  5528    NA 304293         56         999      9999                  14         FL                 55            09   NA    NA  6601 304294         56         999      9999                  14         FL                 55            10   NA    NA 27274 304295         56         999      9999                  14         FL                 55            12   NA 45119 14324 304296         56         999      9999                  50                            42            05   NA    NA 20571 304297         56         999      9999                  90         H                  12            08   NA  6391    NA 304298         56         999      9999                  90         UH                 12            08 5036    NA    NA

coopersciusurturs.blogspot.com

Source: https://sites.google.com/site/xuanphamru/about/functions/control-statements-loops/data-reshaping

Easy Way to Cbind Dataframe With Itself to One Column

0 Response to "Easy Way to Cbind Dataframe With Itself to One Column"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel