# 9 Datasets

## 9.1 Creating data

Often it is useful to learn how to create a dataset in R. Above we created one, which we will reproduce the code here. A dataset can be created in R with the function data.frame(), and we fill it with the variables we would like to create. We conveniently named our dataset *data*. Note that after we specified the function data.frame(), we can give each of the variables its own name. After that, we assign to that name our wanted values.

We use set.seed() because it is a function in R that allows all ‘pseudo-random’ functions in R to yield the same results in different computers. It has the purpose of reproducibility. Otherwise, all of you would have different values for the pseudo-random values. Cool, right?

Now, we print our dataset in the R console with the function print()

```
Names Age Height Weight Gender Courses
1 Alan 23 170.6446 76.89384 Male 1
2 Brian 31 179.5949 59.59420 Male 2
3 Carlos 31 168.8971 48.27693 Male 0
4 Dalton 25 164.8899 78.62134 Male 2
5 Ethan 32 160.8880 54.64516 Male 0
6 Flora 26 161.6283 69.77293 Female 4
7 Gaia 35 194.1584 55.96077 Female 0
8 Helen 26 171.3409 86.53446 Female 3
9 Ingrid 27 165.0931 62.86610 Female 0
10 Jennifer 20 165.5945 59.35840 Female 2
```

```
Names Age Height Weight Gender Courses
1 Alan 23 170.6446 76.89384 Male 1
2 Brian 31 179.5949 59.59420 Male 2
3 Carlos 31 168.8971 48.27693 Male 0
4 Dalton 25 164.8899 78.62134 Male 2
5 Ethan 32 160.8880 54.64516 Male 0
6 Flora 26 161.6283 69.77293 Female 4
7 Gaia 35 194.1584 55.96077 Female 0
8 Helen 26 171.3409 86.53446 Female 3
9 Ingrid 27 165.0931 62.86610 Female 0
10 Jennifer 20 165.5945 59.35840 Female 2
```

We can also print only the first three observations, to inspect big datasets.

```
Names Age Height Weight Gender Courses
1 Alan 23 170.6446 76.89384 Male 1
2 Brian 31 179.5949 59.59420 Male 2
3 Carlos 31 168.8971 48.27693 Male 0
```

Or alternatively, we can also print the last 4 rows to see whether we created the data well.

```
Names Age Height Weight Gender Courses
7 Gaia 35 194.1584 55.96077 Female 0
8 Helen 26 171.3409 86.53446 Female 3
9 Ingrid 27 165.0931 62.86610 Female 0
10 Jennifer 20 165.5945 59.35840 Female 2
```

Importantly, you can explore your data in a new window with the command View(). Note that there are two buttons on the left hand side: one which will pop-up the data in a new window, and the other that you can use to filter the cases you are interested in inspecting.

Important note. All variables must have the same number of observations. In our case, 10. If I put one less name (I forgot to include Jennifer) it would give an error and the dataset would not be created. For example, in that case the following error would be returned:

Error in data.frame(Names = c(“Alan”, “Brian”, “Carlos”, “Dalton”, “Ethan”, : arguments imply differing number of rows: 9, 10

R is telling you where specifically - out of all your variables - the error is occurring , and also, its reason: “*arguments imply differing number of rows: 9, 10*”. In my eyes, it even gives you the solution: There are nine rows, but we need 10 to make it work.

### 9.1.1 Indexing [or Data Manipulation]

Indexing is to select/subset parts of R objects. Let’s take our data example to learn about indexing. There are many ways to do this. We will see several methods here, although they are not exhaustive.

If we want to select only the Number of R Courses people have attended in the past, we use the operators ‘$’ between the name of the data.frame and the name of the variable. (Note, it is important of this type of sub-setting that there are no spaces in the variable’s name - e.g., “Students names”, rather choose either Students.names or Students_names).

` [1] 1 2 0 2 0 4 0 3 0 2`

Another way to do the same thing is to make reference to the dimensionality of the data. When we use this method, we utilize squared brackets in R: [] or [[]]. We know our data has 10 rows and 6 columns. So the data dimensions are 10 x 6. So, if I want only Gender, which is the sixth column, I can select it by typing:

` [1] 1 2 0 2 0 4 0 3 0 2`

In the same way, if we are interested in all the data collected for Carlos, we can select it by typing:

```
Names Age Height Weight Gender Courses
3 Carlos 31 168.8971 48.27693 Male 0
```

Consequently, if we want to know how many R courses Carlos has taken in the past, we can select it by typing:

`[1] 0`

Now let’s expand this a little bit by using logical conditions to select more complex subsets of data. For example, if we are interested in sub-setting the data for Females or for those are older than 25 years. So, let’s start by selecting the observations for females. So, we need to say to R that we want all rows that fulfill the ‘Gender = Female’ condition. So, in R terms, we want all rows which the value for Gender is equal to Female. We translate this in the following way: data$Gender == ‘Female’.

```
Names Age Height Weight Gender Courses
6 Flora 26 161.6283 69.77293 Female 4
7 Gaia 35 194.1584 55.96077 Female 0
8 Helen 26 171.3409 86.53446 Female 3
9 Ingrid 27 165.0931 62.86610 Female 0
10 Jennifer 20 165.5945 59.35840 Female 2
```

If we want older than 25 years old

```
Names Age Height Weight Gender Courses
2 Brian 31 179.5949 59.59420 Male 2
3 Carlos 31 168.8971 48.27693 Male 0
5 Ethan 32 160.8880 54.64516 Male 0
6 Flora 26 161.6283 69.77293 Female 4
7 Gaia 35 194.1584 55.96077 Female 0
8 Helen 26 171.3409 86.53446 Female 3
9 Ingrid 27 165.0931 62.86610 Female 0
```

And, if we want both, Female and older than 25 years old

```
Names Age Height Weight Gender Courses
6 Flora 26 161.6283 69.77293 Female 4
7 Gaia 35 194.1584 55.96077 Female 0
8 Helen 26 171.3409 86.53446 Female 3
9 Ingrid 27 165.0931 62.86610 Female 0
```

Another method is to use the subset() function in R. The subset function is available in base R and can be used to return subsets of a vector, matrix, or data frame which meet a particular condition.

```
Names Age Height Weight Gender Courses
2 Brian 31 179.5949 59.59420 Male 2
4 Dalton 25 164.8899 78.62134 Male 2
6 Flora 26 161.6283 69.77293 Female 4
8 Helen 26 171.3409 86.53446 Female 3
10 Jennifer 20 165.5945 59.35840 Female 2
```

```
Names Age Height Weight Gender Courses
1 Alan 23 170.6446 76.89384 Male 1
4 Dalton 25 164.8899 78.62134 Male 2
6 Flora 26 161.6283 69.77293 Female 4
8 Helen 26 171.3409 86.53446 Female 3
```

```
Names Age Height Weight Gender Courses
2 Brian 31 179.5949 59.59420 Male 2
3 Carlos 31 168.8971 48.27693 Male 0
5 Ethan 32 160.8880 54.64516 Male 0
7 Gaia 35 194.1584 55.96077 Female 0
```

```
Names Age Height Weight Gender Courses
10 Jennifer 20 165.5945 59.3584 Female 2
```

```
Names Age Height Weight Gender Courses
7 Gaia 35 194.1584 55.96077 Female 0
```

```
Age Courses
1 23 1
2 31 2
3 31 0
4 25 2
5 32 0
```

```
Names Age Height Weight Gender Courses
6 Flora 26 161.6283 69.77293 Female 4
8 Helen 26 171.3409 86.53446 Female 3
```

## 9.2 Indexing vectors

Remember we created three vectors called *n.R.courses*, *height* and *height.female* on lesson 04 and 05 ? Lets use these to learn how to index vectors.

If I would like to select my first observation of *n.R.courses*, I make use of square brackets.

`[1] 1`

If I would like to select my first and my second observations of *n.R.courses*, I make use of the ‘:’ operator, which denotes sequence. So, for R, I am effectively telling it I want observations from 1 to 2.

`[1] 1 2`

What if I would like to select my first and my third observations of *n.R.courses*? Since you cannot use ‘:’, you want to combine with the c() function the wanted observations.

`[1] 1 0`

What if I would like to select my first, my third through fifth observations of *n.R.courses*?

`[1] 1 0 2 0`

What if I would like to know which are the observed heights are taller than 170 cm? In this case we can make use of logical operators. As reviewed above, we saw that logical operations return either TRUE or FALSE. So, we can use this as to ‘select’ the observations we want given a condition. So we just write it as above:

`[1] FALSE TRUE TRUE FALSE TRUE`

Then, we put this condition inside the vector, which will return what we want.

`[1] 172.8 180.8 174.3`

This is similar, inside R, as this:

`[1] 172.8 180.8 174.3`

Now, what if I would like to know which is the location in of observed heights are taller than 170 cm?

`[1] 2 3 5`

## 9.3 Understanding your data

Let’s assume we received this data from someone, and we have to analyze it. One of the first thing you need to do is to *understand* the basic characteristics of your data. How many variables/columns does it have? How many observations? What are the characteristics of each variable? And so on.

### 9.3.1 Dimensions

Here you are asking R to return the number of rows and columns.

`[1] 10 6`

### 9.3.2 Variable names

Here you are asking R to return the names of each variable in your dataset.

`[1] "Names" "Age" "Height" "Weight" "Gender" "Courses"`

`[1] "Names" "Age" "Height" "Weight" "Gender" "Courses"`

### 9.3.3 Variables classes

Here you get class of all the variables (columns) in your dataset

```
Names Age Height Weight Gender Courses
"factor" "integer" "numeric" "numeric" "factor" "integer"
```

### 9.3.4 Structure of the data

Here you are asking R to return the characteristics of each variable in your dataset. It shows a summary of each variable, and the first observations. For example, it tells us that there are *10 obs. of 6 variables*. It also itemizes each variable by name, and displays the class of each variable. Gender, for example, is a factor with two levels, one for Male and another for Female. Levels are the different possible varieties of a factor. We could made out of Height one factor with levels “tall” and “short”. Names, for its turn has 10 levels because it only present unique entries (e.g., the different names for each one of our hypothesized participants)

```
'data.frame': 10 obs. of 6 variables:
$ Names : Factor w/ 10 levels "Alan","Brian",..: 1 2 3 4 5 6 7 8 9 10
$ Age : int 23 31 31 25 32 26 35 26 27 20
$ Height : num 171 180 169 165 161 ...
$ Weight : num 76.9 59.6 48.3 78.6 54.6 ...
$ Gender : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 2 2 2 2 2
$ Courses: int 1 2 0 2 0 4 0 3 0 2
```

### 9.3.5 Summaries of your data

```
Names Age Height Weight Gender
Alan :1 Min. :20.00 Min. :160.9 Min. :48.28 Male :5
Brian :1 1st Qu.:25.25 1st Qu.:164.9 1st Qu.:56.81 Female:5
Carlos :1 Median :26.50 Median :167.2 Median :61.23
Dalton :1 Mean :27.60 Mean :170.3 Mean :65.25
Ethan :1 3rd Qu.:31.00 3rd Qu.:171.2 3rd Qu.:75.11
Flora :1 Max. :35.00 Max. :194.2 Max. :86.53
(Other):4
Courses
Min. :0.0
1st Qu.:0.0
Median :1.5
Mean :1.4
3rd Qu.:2.0
Max. :4.0
```

```
vars n mean sd median trimmed mad min max range skew
Age 1 10 27.60 4.58 26.50 27.62 5.93 20.00 35.00 15.00 0.01
Height 2 10 170.27 10.01 167.25 168.46 5.56 160.89 194.16 33.27 1.25
Weight 3 10 65.25 12.23 61.23 64.71 11.21 48.28 86.53 38.26 0.35
Courses 4 10 1.40 1.43 1.50 1.25 2.22 0.00 4.00 4.00 0.39
kurtosis se
Age -1.29 1.45
Height 0.48 3.16
Weight -1.39 3.87
Courses -1.37 0.45
```

```
data[, c("Names", "Gender", "Courses")]
3 Variables 10 Observations
---------------------------------------------------------------------------
Names
n missing distinct
10 0 10
Value Alan Brian Carlos Dalton Ethan Flora Gaia
Frequency 1 1 1 1 1 1 1
Proportion 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Value Helen Ingrid Jennifer
Frequency 1 1 1
Proportion 0.1 0.1 0.1
---------------------------------------------------------------------------
Gender
n missing distinct
10 0 2
Value Male Female
Frequency 5 5
Proportion 0.5 0.5
---------------------------------------------------------------------------
Courses
n missing distinct Info Mean Gmd
10 0 5 0.915 1.4 1.644
Value 0 1 2 3 4
Frequency 4 1 3 1 1
Proportion 0.4 0.1 0.3 0.1 0.1
---------------------------------------------------------------------------
```

### 9.3.6 Contingency tables

```
Gender
Male Female
5 5
```

```
Names
Alan Brian Carlos Dalton Ethan Flora Gaia Helen
1 1 1 1 1 1 1 1
Ingrid Jennifer
1 1
```

```
Courses
0 1 2 3 4
4 1 3 1 1
```

### 9.3.7 Two-way contingency tables for categorical data

```
Gender
Courses Male Female
0 2 2
1 1 0
2 2 1
3 0 1
4 0 1
```

This recipe creates a useful table showing the number of instances that belong to each class as well as the percentage that this represents from the entire dataset.

```
Frequencies Percentage
Male 5 50
Female 5 50
```