Swirls – R programming – Lesson 5 – Missing Values

> library(swirl)

| Hi! Type swirl() when you are ready to begin.

> swirl()

| Welcome to swirl! Please sign in. If you’ve been here before, use the same name as you did then. If you are
| new, call yourself something unique.

What shall I call you? bernhardhack

| Please choose a course, or type 0 to exit swirl.

1: R Programming
2: Take me to the swirl course repository!

Selection: 1

| Please choose a lesson, or type 0 to return to course menu.

1: Basic Building Blocks 2: Workspace and Files 3: Sequences of Numbers
4: Vectors 5: Missing Values 6: Subsetting Vectors
7: Matrices and Data Frames 8: Logic 9: Functions
10: lapply and sapply 11: vapply and tapply 12: Looking at Data
13: Simulation 14: Dates and Times 15: Base Graphics

Selection: 5
| | 0%

| Missing values play an important role in statistics and data analysis. Often, missing values must not be
| ignored, but rather they should be carefully studied to see if there’s an underlying pattern or cause for
| their missingness.


|===== | 5%
| In R, NA is used to represent any value that is ‘not available’ or ‘missing’ (in the statistical sense). In
| this lesson, we’ll explore missing values further.


|========== | 10%
| Any operation involving NA generally yields NA as the result. To illustrate, let’s create a vector c(44, NA,
| 5, NA) and assign it to a variable x.

> x <- c(44, NA, 5, NA) | Your dedication is inspiring! |=============== | 15% | Now, let's multiply x by 3. > x*3
[1] 132 NA 15 NA

| Nice work!
|===================== | 20%
| Notice that the elements of the resulting vector that correspond with the NA values in x are also NA.


|========================== | 25%
| To make things a little more interesting, lets create a vector containing 1000 draws from a standard normal
| distribution with y <- rnorm(1000). > y <- rnorm(1000) | You're the best! |=============================== | 30% | Next, let's create a vector containing 1000 NAs with z <- rep(NA, 1000). > z <- rep(NA, 1000) | Perseverance, that's the answer. |==================================== | 35% | Finally, let's select 100 elements at random from these 2000 values (combining y and z) such that we don't | know how many NAs we'll wind up with or what positions they'll occupy in our final vector -- my_data <- | sample(c(y, z), 100). > my_data <- samplec(c(y, z), 100) Error: could not find function "samplec" > my_data <- sample(c(y, z), 100) | Nice work! |========================================= | 40% | Let's first ask the question of where our NAs are located in our data. The is.na() function tells us whether | each element of a vector is NA. Call is.na() on my_data and assign the result to my_na. > my_ny <- is.na(my_data) | Not quite! Try again. Or, type info() for more options. | Assign the result of is.na(my_data) to the variable my_na. > my_na <- is.na(my_data) | You are amazing! |============================================== | 45% | Now, print my_na to see what you came up with. > my_na
[1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
[37] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
[55] FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
[73] TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE
[91] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE

| That’s the answer I was looking for.
|==================================================== | 50%
| Everywhere you see a TRUE, you know the corresponding element of my_data is NA. Likewise, everywhere you see a
| FALSE, you know the corresponding element of my_data is one of our random draws from the standard normal
| distribution.


|========================================================= | 55%
| In our previous discussion of logical operators, we introduced the `==` operator as a method of testing for
| equality between two objects. So, you might think the expression my_data == NA yields the same results as
| is.na(). Give it a try.

> my_data == NA
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[37] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[73] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

| Your dedication is inspiring!
|============================================================== | 60%
| The reason you got a vector of all NAs is that NA is not really a value, but just a placeholder for a quantity
| that is not available. Therefore the logical expression is incomplete and R has no choice but to return a
| vector of the same length as my_data that contains all NAs.


|=================================================================== | 65%
| Don’t worry if that’s a little confusing. The key takeaway is to be cautious when using logical expressions
| anytime NAs might creep in, since a single NA value can derail the entire thing.


|======================================================================== | 70%
| So, back to the task at hand. Now that we have a vector, my_na, that has a TRUE for every NA and FALSE for
| every numeric value, we can compute the total number of NAs in our data.


|============================================================================= | 75%
| The trick is to recognize that underneath the surface, R represents TRUE as the number 1 and FALSE as the
| number 0. Therefore, if we take the sum of a bunch of TRUEs and FALSEs, we get the total number of TRUEs.


|================================================================================== | 80%
| Let’s give that a try here. Call the sum() function on my_na to count the total number of TRUEs in my_na, and
| thus the total number of NAs in my_data. Don’t assign the result to a new variable.

> sum(my_na)
[1] 62

| You are doing so well!
|======================================================================================== | 85%
| Pretty cool, huh? Finally, let’s take a look at the data to convince ourselves that everything ‘adds up’.
| Print my_data to the console.

> my:data
Error: object ‘my’ not found
> my_data
[1] -0.16629368 -0.42129496 NA NA NA -1.51308034 0.70290182 -0.43921687 NA
[10] 0.82686778 NA NA NA NA NA NA NA NA
[19] 1.00981458 -0.69336608 0.67911988 NA -1.15797177 NA -0.78783302 NA -2.17044504
[28] -0.39462761 NA -0.19135562 NA NA NA -0.80854687 -2.47246396 NA
[37] NA NA NA 0.74824429 0.28179468 -0.26805840 NA NA NA
[46] NA NA 1.02802970 -0.92091188 NA -0.14571645 -0.63697098 NA NA
[55] 0.63796534 NA NA NA NA NA -1.00566234 NA NA
[64] NA 1.03687648 NA NA 2.17104308 NA -0.46387743 NA NA
[73] NA 0.86804538 NA NA NA -0.64092080 -0.03314492 NA NA
[82] NA NA 0.02434736 NA 1.32072814 NA NA -0.36630120 NA
[91] 1.05074646 NA NA -1.92568166 -1.30438332 NA NA -0.49430828 NA
[100] NA

| Excellent job!
|============================================================================================= | 90%
| Now that we’ve got NAs down pat, let’s look at a second type of missing value — NaN, which stands for ‘not a
| number’. To generate NaN, try dividing (using a forward slash) 0 by 0 now.

> 0/0
[1] NaN

| You are doing so well!
|================================================================================================== | 95%
| Let’s do one more, just for fun. In R, Inf stands for infinity. What happens if you subtract Inf from Inf?

> Inf-Inf
[1] NaN

| Excellent job!
|=======================================================================================================| 100%
| Would you like to receive credit for completing this course on Coursera.org?

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.