In Section 4.3.1, we saw that filter
works by taking one or more source columns and filtering it by applying a “filtering” function. To recap, here’s an example of filter using the source => f::Function
syntax: filter(:name => name -> name == "Alice", df)
.
In Section 4.4, we saw that select
can take one or more source columns and put it into one or more target columns source => target
. Also to recap here’s an example: select(df, :name => :people_names)
.
In this section, we discuss how to transform variables, that is, how to modify data. In DataFrames.jl
, the syntax is source => transformation => target
.
Like before, we use the grades_2020
dataset:
function grades_2020()
name = ["Sally", "Bob", "Alice", "Hank"]
grade_2020 = [1, 5, 8.5, 4]
DataFrame(; name, grade_2020)
end
grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
Suppose we want to increase all the grades in grades_2020
by 1. First, we define a function that takes as argument a vector of data and returns all of its elements increased by 1. Then we use the transform
function from DataFrames.jl
that, like all native DataFrames.jl
’s functions, takes a DataFrame
as first argument followed by the transformation syntax:
plus_one(grades) = grades .+ 1
transform(grades_2020(), :grade_2020 => plus_one)
name | grade_2020 | grade_2020_plus_one |
---|---|---|
Sally | 1.0 | 2.0 |
Bob | 5.0 | 6.0 |
Alice | 8.5 | 9.5 |
Hank | 4.0 | 5.0 |
Here, the plus_one
function receives the whole :grade_2020
column. That is the reason why we’ve added the broadcasting “dot” .
before the plus +
operator. For a recap on broadcasting please see Section 3.3.1.
Like we said above, the DataFrames.jl
minilanguage is always source => transformation => target
. So, if we want to keep the naming of the target
column in the output, we can do:
transform(grades_2020(), :grade_2020 => plus_one => :grade_2020)
name | grade_2020 |
---|---|
Sally | 2.0 |
Bob | 6.0 |
Alice | 9.5 |
Hank | 5.0 |
We can also use the keyword argument renamecols=false
:
transform(grades_2020(), :grade_2020 => plus_one; renamecols=false)
name | grade_2020 |
---|---|
Sally | 2.0 |
Bob | 6.0 |
Alice | 9.5 |
Hank | 5.0 |
The same transformation can also be written with select
as follows:
select(grades_2020(), :, :grade_2020 => plus_one => :grade_2020)
name | grade_2020 |
---|---|
Sally | 2.0 |
Bob | 6.0 |
Alice | 9.5 |
Hank | 5.0 |
where the :
means “select all the columns” as described in Section 4.4. Alternatively, you can also use Julia’s broadcasting and modify the column grade_2020
by accessing it with df.grade_2020
:
df = grades_2020()
df.grade_2020 = plus_one.(df.grade_2020)
df
name | grade_2020 |
---|---|
Sally | 2.0 |
Bob | 6.0 |
Alice | 9.5 |
Hank | 5.0 |
But, although the last example is easier since it builds on more native Julia operations, we strongly advise to use the functions provided by DataFrames.jl
in most cases because they are more capable and easier to work with.
To show how to transform two columns at the same time, we use the left joined data from Section 4.6:
leftjoined = leftjoin(grades_2020(), grades_2021(); on=:name)
name | grade_2020 | grade_2021 |
---|---|---|
Sally | 1.0 | 9.5 |
Hank | 4.0 | 6.0 |
Bob | 5.0 | missing |
Alice | 8.5 | missing |
With this, we can add a column saying whether someone was approved by the criterion that one of their grades was above 5.5:
pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
transform(leftjoined, [:grade_2020, :grade_2021] => pass; renamecols=false)
name | grade_2020 | grade_2021 | grade_2020_grade_2021 |
---|---|---|---|
Sally | 1.0 | 9.5 | true |
Hank | 4.0 | 6.0 | true |
Bob | 5.0 | missing | missing |
Alice | 8.5 | missing | true |
We can clean up the outcome and put the logic in a function to get a list of all the approved students:
function only_pass()
leftjoined = leftjoin(grades_2020(), grades_2021(); on=:name)
pass(A, B) = [5.5 < a || 5.5 < b for (a, b) in zip(A, B)]
leftjoined = transform(leftjoined, [:grade_2020, :grade_2021] => pass => :pass)
passed = subset(leftjoined, :pass; skipmissing=true)
return passed.name
end
only_pass()
["Sally", "Hank", "Alice"]