As discussed in Section 4.1, CSV.jl
will do its best to guess what kind of types your data have as columns. However, this won’t always work perfectly. In this section, we show why suitable types are important and we fix wrong data types. To be more clear about the types, we show the text output for DataFrame
s instead of a pretty-formatted table. In this section, we work with the following dataset:
function wrong_types()
id = 1:4
date = ["28-01-2018", "03-04-2019", "01-08-2018", "22-11-2020"]
age = ["adolescent", "adult", "infant", "adult"]
DataFrame(; id, date, age)
end
wrong_types()
4×3 DataFrame
Row │ id date age
│ Int64 String String
─────┼───────────────────────────────
1 │ 1 28-01-2018 adolescent
2 │ 2 03-04-2019 adult
3 │ 3 01-08-2018 infant
4 │ 4 22-11-2020 adult
Because the date column has the wrong type, sorting won’t work correctly:
sort(wrong_types(), :date)
4×3 DataFrame
Row │ id date age
│ Int64 String String
─────┼───────────────────────────────
1 │ 3 01-08-2018 infant
2 │ 2 03-04-2019 adult
3 │ 4 22-11-2020 adult
4 │ 1 28-01-2018 adolescent
To fix the sorting, we can use the Date
module from Julia’s standard library as described in Section 3.5.1:
function fix_date_column(df::DataFrame)
strings2dates(dates::Vector) = Date.(dates, dateformat"dd-mm-yyyy")
dates = strings2dates(df[!, :date])
df[!, :date] = dates
df
end
fix_date_column(wrong_types())
4×3 DataFrame
Row │ id date age
│ Int64 Date String
─────┼───────────────────────────────
1 │ 1 2018-01-28 adolescent
2 │ 2 2019-04-03 adult
3 │ 3 2018-08-01 infant
4 │ 4 2020-11-22 adult
Now, sorting will work as intended:
df = fix_date_column(wrong_types())
sort(df, :date)
4×3 DataFrame
Row │ id date age
│ Int64 Date String
─────┼───────────────────────────────
1 │ 1 2018-01-28 adolescent
2 │ 3 2018-08-01 infant
3 │ 2 2019-04-03 adult
4 │ 4 2020-11-22 adult
For the age column, we have a similar problem:
sort(wrong_types(), :age)
4×3 DataFrame
Row │ id date age
│ Int64 String String
─────┼───────────────────────────────
1 │ 1 28-01-2018 adolescent
2 │ 2 03-04-2019 adult
3 │ 4 22-11-2020 adult
4 │ 3 01-08-2018 infant
This isn’t right, because an infant is younger than adults and adolescents. The solution for this issue and any sort of categorical data is to use CategoricalArrays.jl
:
using CategoricalArrays
With the CategoricalArrays.jl
package, we can add levels that represent the ordering of our categorical variable to our data:
function fix_age_column(df)
levels = ["infant", "adolescent", "adult"]
ages = categorical(df[!, :age]; levels, ordered=true)
df[!, :age] = ages
df
end
fix_age_column(wrong_types())
4×3 DataFrame
Row │ id date age
│ Int64 String Cat…
─────┼───────────────────────────────
1 │ 1 28-01-2018 adolescent
2 │ 2 03-04-2019 adult
3 │ 3 01-08-2018 infant
4 │ 4 22-11-2020 adult
NOTE: Also note that we are passing the argument
ordered=true
which tellsCategoricalArrays.jl
’scategorical
function that our categorical data is “ordered”. Without this any type of sorting or bigger/smaller comparisons would not be possible.
Now, we can sort the data correctly on the age column:
df = fix_age_column(wrong_types())
sort(df, :age)
4×3 DataFrame
Row │ id date age
│ Int64 String Cat…
─────┼───────────────────────────────
1 │ 3 01-08-2018 infant
2 │ 1 28-01-2018 adolescent
3 │ 2 03-04-2019 adult
4 │ 4 22-11-2020 adult
Because we have defined convenient functions, we can now define our fixed data by just performing the function calls:
function correct_types()
df = wrong_types()
df = fix_date_column(df)
df = fix_age_column(df)
end
correct_types()
4×3 DataFrame
Row │ id date age
│ Int64 Date Cat…
─────┼───────────────────────────────
1 │ 1 2018-01-28 adolescent
2 │ 2 2019-04-03 adult
3 │ 3 2018-08-01 infant
4 │ 4 2020-11-22 adult
Since age in our data is ordinal (ordered=true
), we can properly compare categories of age:
df = correct_types()
a = df[1, :age]
b = df[2, :age]
a < b
true
which would give wrong comparisons if the element type were strings:
"infant" < "adult"
false