The Basics of R and RStudio

Introduction

Data structures in R are tools for storing and organizing multiple values.

They help to organize stored data in a way that the data can be used more effectively. Data structures vary according to the number of dimensions and the data types (heterogeneous or homogeneous) contained. The primary data structures are:

Vectors (link)
Lists
Data frames
Matrices
Arrays
Factors

Data structures

1. Vectors

Discussed in a previous post

2. Lists

Lists are objects/containers that hold elements of the same or different types. They can containing strings, numbers, vectors, matrices, functions, or other lists. Lists are created with the list() function

Examples

a. Three element list

list_1 <- list(10, 30, 50)

b. Single element list

list_2 <- list(c(10, 30, 50))

c. Three element list

list_3 <- list(1:3, c(50,40), 3:-5)

d. List with elements of different types

list_4 <- list(c("a", "b", "c"), 5:-1)

e. List which contains a list

list_5 <- list(c("a", "b", "c"), 5:-1, list_1)

f. Set names for the list elements

names(list_5)

NULL

names(list_5) <- c("character vector", "numeric vector", "list")
names(list_5)

[1] "character vector" "numeric vector"   "list"

g. Access elements

list_5[[1]]

[1] "a" "b" "c"

list_5[["character vector"]]

[1] "a" "b" "c"

h. Length of list

length(list_1)

[1] 3

length(list_5)

[1] 3

3. Data frames

A data frame is one of the most common data objects used to store tabular data in R. Tabular data has rows representing observations and columns representing variables. Dataframes contain lists of equal-length vectors. Each column holds a different type of data, but within each column, the elements must be of the same type. The most common data frame characteristics are listed below:

• Columns should have a name;

• Row names should be unique;

• Various data can be stored (such as numeric, factor, and character);

• The individual columns should contain the same number of data items.

Creation of data frames

level <- c("Low", "Mid", "High")
language <- c("R", "RStudio", "Shiny")
age <- c(25, 36, 47)

df_1 <- data.frame(level, language, age)

Functions used to manipulate data frames

a. Number of rows

nrow(df_1)

[1] 3

b. Number of columns

ncol(df_1)

[1] 3

c. Dimensions

dim(df_1)

[1] 3 3

d. Class of data frame

class(df_1)

[1] "data.frame"

e. Column names

colnames(df_1)

[1] "level"    "language" "age"

f. Row names

rownames(df_1)

[1] "1" "2" "3"

g. Top and bottom values

head(df_1, n=2)

  level language age
1   Low        R  25
2   Mid  RStudio  36

tail(df_1, n=2)

  level language age
2   Mid  RStudio  36
3  High    Shiny  47

h. Access columns

df_1$level

[1] "Low"  "Mid"  "High"

i. Access individual elements

df_1[3,2]

[1] "Shiny"

df_1[2, 1:2]

  level language
2   Mid  RStudio

j. Access columns with index

df_1[, 3]

[1] 25 36 47

df_1[, c("language")]

[1] "R"       "RStudio" "Shiny"

k. Access rows with index

df_1[2, ]

  level language age
2   Mid  RStudio  36

4. Matrices

A matrix is a rectangular two-dimensional (2D) homogeneous data set containing rows and columns. It contains real numbers that are arranged in a fixed number of rows and columns. Matrices are generally used for various mathematical and statistical applications.

a. Creation of matrices

m1 <- matrix(1:9, nrow = 3, ncol = 3) 
m2 <- matrix(21:29, nrow = 3, ncol = 3) 
m3 <- matrix(1:12, nrow = 2, ncol = 6)

b. Obtain the dimensions of the matrices

# m1
nrow(m1)

[1] 3

ncol(m1)

[1] 3

dim(m1)

[1] 3 3

# m3
nrow(m3)

[1] 2

ncol(m3)

[1] 6

dim(m3)

[1] 2 6

c. Arithmetic with matrices

m1+m2

     [,1] [,2] [,3]
[1,]   22   28   34
[2,]   24   30   36
[3,]   26   32   38

m1-m2

     [,1] [,2] [,3]
[1,]  -20  -20  -20
[2,]  -20  -20  -20
[3,]  -20  -20  -20

m1*m2

     [,1] [,2] [,3]
[1,]   21   96  189
[2,]   44  125  224
[3,]   69  156  261

m1/m2

           [,1]      [,2]      [,3]
[1,] 0.04761905 0.1666667 0.2592593
[2,] 0.09090909 0.2000000 0.2857143
[3,] 0.13043478 0.2307692 0.3103448

m1 == m2

      [,1]  [,2]  [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE

d. Matrix multiplication

m5 <- matrix(1:10, nrow = 5)
m6 <- matrix(43:34, nrow = 5)

m5*m6

     [,1] [,2]
[1,]   43  228
[2,]   84  259
[3,]  123  288
[4,]  160  315
[5,]  195  340

# m5%*%m6 will not work because of the dimesions.
# the vector m6 needs to be transposed.

# Transpose
m5%*%t(m6)

     [,1] [,2] [,3] [,4] [,5]
[1,]  271  264  257  250  243
[2,]  352  343  334  325  316
[3,]  433  422  411  400  389
[4,]  514  501  488  475  462
[5,]  595  580  565  550  535

e. Generate an identity matrix

diag(5)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

f. Column and row names

colnames(m5)

NULL

rownames(m6)

NULL

5. Arrays

An array is a multidimensional vector that stores homogeneous data. It can be thought of as a stacked matrix and stores data in more than 2 dimensions (n-dimensional). An array is composed of rows by columns by dimensions. Example: an array with dimensions, dim = c(2,3,3), has 2 rows, 3 columns, and 3 matrices.

a. Creating arrays

arr_1 <- array(1:12, dim = c(2,3,2))

arr_1

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

b. Filter array by index

arr_1[1, , ]

     [,1] [,2]
[1,]    1    7
[2,]    3    9
[3,]    5   11

arr_1[1, ,1]

[1] 1 3 5

arr_1[, , 1]

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

6. Factors

Factors are used to store integers or strings which are categorical. They categorize data and store the data in different levels. This form of data storage is useful for statistical modeling. Examples include TRUE or FALSE and male or female.

vector <- c("Male", "Female")
factor_1 <- factor(vector)
factor_1

[1] Male   Female
Levels: Female Male

factor_2 <- as.factor(vector)
factor_2

[1] Male   Female
Levels: Female Male

as.numeric(factor_2)

[1] 2 1