Let’s dive into how to handle missing values in DataFrames.jl
. We’ll cover three main approaches for dealing with missing data:
ismissing
and dropmissing
coalesce
skipmissing
First, we need a DataFrame
filled with missing values to showcase these approaches:
df_missing = DataFrame(;
name=[missing, "Sally", "Alice", "Hank"],
age=[17, missing, 20, 19],
grade_2020=[5.0, 1.0, missing, 4.0],
)
name | age | grade_2020 |
---|---|---|
missing | 17 | 5.0 |
Sally | missing | 1.0 |
Alice | 20 | missing |
Hank | 19 | 4.0 |
This is the same DataFrame
from Section 4 but with some missing
values added.
Missing
typeSome languages have several types to represent missing values. One such example is R which uses NA
, NA_integer_
, NA_real_
, NA_character_
, and NA_complex_
. Julia, on the contrary, has only one: Missing
.
typeof(missing)
Missing
As you can see missing
is an instance of the type Missing
.
NOTE: In the Julia Style Guide, there’s a guidance to use camel case for types and modules (see Section 8.2).
The first thing we need to cover for missing
values is that they propagate through several operations. For example, addition, subtraction, multiplication, and division:
missing + 1
missing
missing - 1
missing
missing * 1
missing
missing / 1
missing
They also propagate through equality and comparison operators:
missing == 1
missing
missing == missing
missing
missing > 1
missing
missing > missing
missing
That’s why we need to be very cautious when comparing and testing equalities in the presence of missing
values. For equality testing use the ismissing
function instead.
Most of the time we want to remove missing values from our data.
Removing missing
s can be done in two ways:
dropmissing
function applied to a whole DataFrame
or a subset of its columns.ismissing
function applied to a filtering procedure (see Section 4.3).The dropmissing
function takes as first positional argument a DataFrame
, and as an optional second argument either a single column or a vector of columns by which you’ll want to remove the missing data from.
By default, if you do not specify column(s), as the second positional argument, it will remove any observation (row) having missing
values:
dropmissing(df_missing)
name | age | grade_2020 |
---|---|---|
Hank | 19 | 4.0 |
Since 3 out of 4 rows had at least one missing
value, we get back a DataFrame
with a single row as a result.
However, if we specify column(s) as the second positional argument to dropmissing
, the resulting DataFrame
will only drop rows that have missing
values in the specified column(s).
Here’s a single column with a Symbol
:
dropmissing(df_missing, :name)
name | age | grade_2020 |
---|---|---|
Sally | missing | 1.0 |
Alice | 20 | missing |
Hank | 19 | 4.0 |
And now multiple columns with a vector of Symbol
s:
dropmissing(df_missing, [:name, :age])
name | age | grade_2020 |
---|---|---|
Alice | 20 | missing |
Hank | 19 | 4.0 |
NOTE: You can use any of the column selectors described in Section 4.4 for the second positional argument of
dropmissing
.
The ismissing
function tests if the underlying value is of the Missing
type returning either true
or false
.
You can use negation !
to use it both ways:
ismissing
to just keep the missing
values.!ismissing
to keep anything but the missing
valuesfilter(:name => ismissing, df_missing)
name | age | grade_2020 |
---|---|---|
missing | 17 | 5.0 |
filter(:name => !ismissing, df_missing)
name | age | grade_2020 |
---|---|---|
Sally | missing | 1.0 |
Alice | 20 | missing |
Hank | 19 | 4.0 |
A common data wrangling pattern is to replace or fill missing values.
Like R (and SQL), Julia has the coalesce
function. We often use it in a broadcasted way over an array to fill all missing
values with a specific value.
Here’s an example of a vector containing two missing
values:
coalesce.([missing, "some value", missing], "zero")
["zero", "some value", "zero"]
You can see that coalesce
replaces missing
values with "zero"
.
We can definitely use it in a transform procedure (Section 4.7):
transform(df_missing, :name => ByRow(x -> coalesce(x, "John Doe")); renamecols=false)
name | age | grade_2020 |
---|---|---|
John Doe | 17 | 5.0 |
Sally | missing | 1.0 |
Alice | 20 | missing |
Hank | 19 | 4.0 |
As we saw on Section 4.8, we can use combine
to apply summarizing functions to data. However, as explained, missing
values propagate through most operations in Julia. Suppose you want to calculate the mean of :grade_2020
column in our df_missing
:
combine(df_missing, :grade_2020 => mean)
grade_2020_mean |
---|
missing |
You can skip missing values in any array or summarizing function by passing the skipmissing
function:
combine(df_missing, :grade_2020 => mean ∘ skipmissing )
grade_2020_mean_skipmissing |
---|
3.3333333333333335 |
NOTE: We are using the function composition operator
∘
(which you can type with\circ<TAB>
) to compose two functions into one. It is just like the mathematical operator:$$f \circ g (x) = f(g(x))$$
Hence,
(mean ∘ skipmissing)(x)
becomesmean(skipmissing(x))
.