ECON 413
Data types and data objects
Erol Taymaz
Department of Economics
Middle East Technical University
Base R and most R packages are available at cran.r-project.org
There are thousands of R packages
Objects
Everything in R is an object
## [1] 1 2 3 4 5
## [1] 15
## function (..., na.rm = FALSE) .Primitive("sum")
##
## Call:
## lm(formula = a ~ b)
##
## Coefficients:
## (Intercept) b
## 0.01398 0.51570
## List of 12
## $ coefficients : Named num [1:2] 0.014 0.516
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "b"
## $ residuals : Named num [1:100] 0.1403 -1.1579 -1.0822 0.4535 0.0618 ...
## ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
## $ effects : Named num [1:100] 0.093 7.815 -1.181 0.464 0.071 ...
## ..- attr(*, "names")= chr [1:100] "(Intercept)" "b" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:100] 1.0367 -0.1663 0.682 -0.0744 -0.0659 ...
## ..- attr(*, "names")= chr [1:100] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:100, 1:2] -10 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:100] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "b"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.1 1.03
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 98
## $ xlevels : Named list()
## $ call : language lm(formula = a ~ b)
## $ terms :Classes 'terms', 'formula' language a ~ b
## .. ..- attr(*, "variables")= language list(a, b)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "a" "b"
## .. .. .. ..$ : chr "b"
## .. ..- attr(*, "term.labels")= chr "b"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(a, b)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "a" "b"
## $ model :'data.frame': 100 obs. of 2 variables:
## ..$ a: num [1:100] 1.177 -1.3242 -0.4002 0.3791 -0.0041 ...
## ..$ b: num [1:100] 1.983 -0.35 1.295 -0.171 -0.155 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language a ~ b
## .. .. ..- attr(*, "variables")= language list(a, b)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "a" "b"
## .. .. .. .. ..$ : chr "b"
## .. .. ..- attr(*, "term.labels")= chr "b"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(a, b)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "a" "b"
## - attr(*, "class")= chr "lm"
## (Intercept) b
## 0.01397907 0.51569589
## [1] -2.029626e-18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.004508 -0.649363 -0.063774 -0.009299 0.768086 3.326006
##
## Call:
## lm(formula = a ~ b)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.28397 -0.31069 -0.00959 0.38310 1.96025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01398 0.06674 0.209 0.835
## b 0.51570 0.04402 11.715 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6671 on 98 degrees of freedom
## Multiple R-squared: 0.5834, Adjusted R-squared: 0.5791
## F-statistic: 137.2 on 1 and 98 DF, p-value: < 2.2e-16
All objects have a “mode” (type of information). Atomic “modes” are the basic building blocks for data objects in R.There are 6 atomic modes:
## [1] "numeric"
## [1] "character"
## [1] "logical"
## [1] "numeric"
## [1] "character"
## [1] "logical"
## [1] "character"
## [1] "numeric"
## [1] "function"
All objects belong to one or more classes. There is no limit on the number of classes.
The class of an object defines how the object will be treated by functions.
## [1] "integer"
## [1] "character"
## [1] "logical"
## [1] "character"
## [1] "numeric"
## [1] "data.frame"
## [1] "integer"
## [1] "numeric"
## [1] "matrix" "array"
## [1] "character"
## [1] "matrix" "array"
## [1] 1 2
## [1] 1 2 3 NA Inf -Inf NaN
c
, rep
, seq
, sample
and runif
functions
## [1] "numeric"
## num [1:3] 1 2 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.500 2.000 2.333 3.000 4.000
## [1] 1 2 3 4 5
## [1] 5 4 3 2 1
## [1] 1 2 3 4 5 5 4 3 2 1
a <- rep(c(1:2), times = 5)
b <- rep(c(1:2), each = 3)
d <- rep(c(1:2), times = 2, each = 3)
e <- rep(c(1:2), len = 5)
a
## [1] 1 2 1 2 1 2 1 2 1 2
## [1] 1 1 1 2 2 2
## [1] 1 1 1 2 2 2 1 1 1 2 2 2
## [1] 1 2 1 2 1
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
## [1] 1.000000 1.166667 1.333333 1.500000 1.666667 1.833333 2.000000
## [1] 5 5 5 2 2 2 5 4 1 3
## [1] 7 1 9 3 5
## [1] 0.45509973 0.12447201 0.18775116 0.07959337 0.49973874 0.98727218
## [7] 0.52644573 0.76917827 0.68505859 0.13331159
## [1] 0.2875775 0.7883051 0.4089769 0.8830174 0.9404673 0.0455565 0.5281055
## [8] 0.8924190 0.5514350 0.4566147
## [1] 0 0 0 0 0 0 0 0 0 0
## [1] 0 0 0 0 0 0 0 0 0 0
## [1] TRUE
## [1] "" "" "" "" "" "" "" "" "" ""
## [1] NA NA NA NA NA NA NA NA NA NA
## [1] 1
## [1] 1 NA
## [1] 1 40
## [1] 1 2 4 40 NA NA NA NA NA 10
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $b
## [1] "a"
##
## $c
## [1] TRUE FALSE TRUE FALSE
## Length Class Mode
## a 10 -none- numeric
## b 1 -none- character
## c 4 -none- logical
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] "list"
## [1] "integer"
## [1] 3
## [1] TRUE
a <- matrix(1:15, ncol = 3, nrow = 5)
b <- matrix(c("a", "b", "c", "d", "e", "f"), ncol = 3, nrow = 2)
a
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
## [,1] [,2] [,3]
## [1,] "a" "c" "e"
## [2,] "b" "d" "f"
## [1] "matrix" "array"
## int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
## V1 V2 V3
## Min. :1 Min. : 6 Min. :11
## 1st Qu.:2 1st Qu.: 7 1st Qu.:12
## Median :3 Median : 8 Median :13
## Mean :3 Mean : 8 Mean :13
## 3rd Qu.:4 3rd Qu.: 9 3rd Qu.:14
## Max. :5 Max. :10 Max. :15
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 3 8 13
## [1] 1 15
## [1] 6 7 8 9 10
## [1] 6 8
An array is a multidimensional object. A matrix is an nxm dimensional array.
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9
## [4,] 4 10
## [5,] 5 11
## [6,] 6 12
## [1] "matrix" "array"
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 13 17 21
## [2,] 14 18 22
## [3,] 15 19 23
## [4,] 16 20 24
## [1] "array"
## [1] 7
aa <- data.frame(a = 1:4, b = c("a", "b", "c", "d"),
z = c(1, 3, 5, NA))
bb <- data.frame(a = 1, b = c("A", "B", "C", "D"), z = "Z",
stringsAsFactors = FALSE)
aa
## a b z
## 1 1 a 1
## 2 2 b 3
## 3 3 c 5
## 4 4 d NA
## a b z
## 1 1 A Z
## 2 1 B Z
## 3 1 C Z
## 4 1 D Z
## [1] "data.frame"
## [1] "a" "b" "z"
## 'data.frame': 4 obs. of 3 variables:
## $ a: int 1 2 3 4
## $ b: chr "a" "b" "c" "d"
## $ z: num 1 3 5 NA
## a b z
## Min. :1.00 Length:4 Min. :1
## 1st Qu.:1.75 Class :character 1st Qu.:2
## Median :2.50 Mode :character Median :3
## Mean :2.50 Mean :3
## 3rd Qu.:3.25 3rd Qu.:4
## Max. :4.00 Max. :5
## NA's :1
## a b z a b z
## 1 1 a 1 1 A Z
## 2 2 b 3 1 B Z
## 3 3 c 5 1 C Z
## 4 4 d NA 1 D Z
## a b z
## 1 1 a 1
## 2 2 b 3
## 3 3 c 5
## 4 4 d <NA>
## 5 1 A Z
## 6 1 B Z
## 7 1 C Z
## 8 1 D Z
Use the merge function to merge two data frames
## [1] 1 2 3 4
## [1] 1 2 3 4
## [1] 1 2 3 4
## a
## 1 1
## 2 2
## 3 3
## 4 4
## [1] 1 2 3 4
## [1] 1 2 3 4
## a
## 1 1
## 2 2
## 3 3
## 4 4
## a b z
## 1 1 a 1
## a b z
## 1 1 a 1
## 2 2 b 3
## a b z
## 1 1 a 1
## 3 3 c 5
## [1] "a" "b"
## [1] "a" "b"
## [1] "a" "b"
## [1] "a" "b"
## a b z x y
## 1 1 a 1 4.0 4.0
## 2 2 b 3 8.0 16.0
## 3 3 c 5 1.5 4.5
## 4 4 d NA 7.0 28.0
## a b z y
## 1 1 a 1 4.0
## 2 2 b 3 16.0
## 3 3 c 5 4.5
## 4 4 d NA 28.0
## a b z y
## 1 1 a 1 2.000000
## 2 2 b 3 4.000000
## 3 3 c 5 2.121320
## 4 4 d NA 5.291503
## a b z y
## 1 1 a 1 10.000000
## 2 2 b 3 4.000000
## 3 3 c 5 2.121320
## 4 4 d NA 5.291503
## a b z y
## 1 1 aaa 1 10.000000
## 2 2 b 3 4.000000
## 3 3 c 5 2.121320
## 4 4 d NA 5.291503
## a b z y
## 3 3 c 5 2.121320
## 4 4 d NA 5.291503
## a b z
## 3 3 c 5
## 4 4 d NA
## a b z
## 1 1 aaa 1
## 2 2 b 3
## a b z y
## 1 1000 1000 1000 10.000000
## 2 2 b 3 4.000000
## 3 3 c 5 2.121320
## 4 4 d NA 5.291503
## a b z y
## 1 1000 1000 1000 10.00000
## 2 2 b 3 4.00000
## 3 3 c 5 2.12132
Be very careful in using na.omit!
GDP growth rate data
Country | 2000 | 2001 | 2002 | 2003 | 2004 |
---|---|---|---|---|---|
Germany | 0.0409 | 0.0013 | 0.0126 | 0.0126 | 0.0001 |
Korea | 0.0262 | 0.0232 | 0.0186 | 0.0244 | 0.0146 |
Turkey | 0.0384 | 0.0011 | 0.0099 | 0.0399 | 0.0316 |
US | 0.0220 | 0.0107 | 0.0143 | 0.0318 | 0.0274 |
How many variables are there in the data?
How many observations?
How many variables? 3 variables (country, year, gdp growth rate)
How many observations? 4x5 = 20 observations
GDP growth rate data - long format
Country | year | gdpgr |
---|---|---|
Germany | 2000 | 0.0013 |
Germany | 2001 | 0.0013 |
Germany | 2002 | 0.0126 |
Germany | 2003 | 0.0116 |
Use the reshape function to convert wide-to-long and long-to-wide