Load and Save Files - Julia Data Science

4.1 Load and Save Files

Having only data inside Julia programs and not being able to load or save it would be very limiting. Therefore, we start by mentioning how to store files to and load files from disk. We focus on CSV, see Section 4.1.1, and Excel, see Section 4.1.2, file formats since those are the most common data storage formats for tabular data.

4.1.1 CSV

Comma-separated values (CSV) files are are very effective way to store tables. CSV files have two advantages over other data storage files. First, it does exactly what the name indicates it does, namely storing values by separating them using commas ,. This acronym is also used as the file extension. So, be sure that you save your files using the “.csv” extension such as “myfile.csv”. To demonstrate how a CSV file looks, we can add the CSV.jl package using the Pkg REPL mode (Section 3.5.4.2):

julia> ]

pkg> add CSV

and load it via:

using CSV

We can now use our previous data:

grades_2020()

name	grade_2020
Sally	1.0
Bob	5.0
Alice	8.5
Hank	4.0

and read it from a file after writing it:

function write_grades_csv()
    path = "grades.csv"
    CSV.write(path, grades_2020())
end

path = write_grades_csv()
read(path, String)

name,grade_2020
Sally,1.0
Bob,5.0
Alice,8.5
Hank,4.0

Here, we also see the second benefit of CSV data format: the data can be read by using a simple text editor. This differs from many alternative data formats which require proprietary software, e.g. Excel.

This works wonders, but what if our data contains commas , as values? If we were to naively write data with commas, it would make the files very hard to convert back to a table. Luckily, CSV.jl handles this for us automatically. Consider the following data with commas ,:

function grades_with_commas()
    df = grades_2020()
    df[3, :name] = "Alice,"
    df
end
grades_with_commas()

Table 5: Grades with commas.
name	grade_2020
Sally	1.0
Bob	5.0
Alice,	8.5
Hank	4.0

If we write this, we get:

function write_comma_csv()
    path = "grades-commas.csv"
    CSV.write(path, grades_with_commas())
end
path = write_comma_csv()
read(path, String)

name,grade_2020
Sally,1.0
Bob,5.0
"Alice,",8.5
Hank,4.0

So, CSV.jl adds quotation marks " around the comma-containing values. Another common way to solve this problem is to write the data to a tab-separated values (TSV) file format. This assumes that the data doesn’t contain tabs, which holds in most cases.

Also, note that TSV files can also be read using a simple text editor, and these files use the “.tsv” extension.

function write_comma_tsv()
    path = "grades-comma.tsv"
    CSV.write(path, grades_with_commas(); delim='\t')
end
read(write_comma_tsv(), String)

name    grade_2020
Sally   1.0
Bob 5.0
Alice,  8.5
Hank    4.0

Text file formats like CSV and TSV files can also be found that use other delimiters, such as semicolons “;”, spaces “ ”, or even something as unusual as “π”.

function write_space_separated()
    path = "grades-space-separated.csv"
    CSV.write(path, grades_2020(); delim=' ')
end
read(write_space_separated(), String)

name grade_2020
Sally 1.0
Bob 5.0
Alice 8.5
Hank 4.0

By convention, it’s still best to give files with special delimiters, such as “;”, the “.csv” extension.

Loading CSV files using CSV.jl is done in a similar way. You can use CSV.read and specify in what kind of format you want the output. We specify a DataFrame.

path = write_grades_csv()
CSV.read(path, DataFrame)

name	grade_2020
Sally	1.0
Bob	5.0
Alice	8.5
Hank	4.0

Conveniently, CSV.jl will automatically infer column types for us:

path = write_grades_csv()
df = CSV.read(path, DataFrame)

4×2 DataFrame
 Row │ name     grade_2020
     │ String7  Float64
─────┼─────────────────────
   1 │ Sally           1.0
   2 │ Bob             5.0
   3 │ Alice           8.5
   4 │ Hank            4.0

It works even for far more complex data:

my_data = """
    a,b,c,d,e
    Kim,2018-02-03,3,4.0,2018-02-03T10:00
    """
path = "my_data.csv"
write(path, my_data)
df = CSV.read(path, DataFrame)

1×5 DataFrame
 Row │ a        b           c      d        e
     │ String3  Date        Int64  Float64  DateTime
─────┼──────────────────────────────────────────────────────────
   1 │ Kim      2018-02-03      3      4.0  2018-02-03T10:00:00

These CSV basics should cover most use cases. For more information, see the CSV.jl documentation and especially the CSV.File constructor docstring.

4.1.2 Excel

There are multiple Julia packages to read Excel files. In this book, we will only look at XLSX.jl, because it is the most actively maintained package in the Julia ecosystem that deals with Excel data. As a second benefit, XLSX.jl is written in pure Julia, which makes it easy for us to inspect and understand what’s going on under the hood.

Load XLSX.jl via

using XLSX:
    eachtablerow,
    readxlsx,
    writetable

To write files, we define a little helper function for data and column names:

function write_xlsx(name, df::DataFrame)
    path = "$name.xlsx"
    data = collect(eachcol(df))
    cols = names(df)
    writetable(path, data, cols)
end

Now, we can easily write the grades to an Excel file:

function write_grades_xlsx()
    path = "grades"
    write_xlsx(path, grades_2020())
    "$path.xlsx"
end

When reading it back, we will see that XLSX.jl puts the data in a XLSXFile type and we can access the desired sheet much like a Dict:

path = write_grades_xlsx()
xf = readxlsx(path)

XLSXFile("grades.xlsx") containing 1 Worksheet
            sheetname size          range        
-------------------------------------------------
               Sheet1 5x2           A1:B5

xf = readxlsx(write_grades_xlsx())
sheet = xf["Sheet1"]
eachtablerow(sheet) |> DataFrame

name	grade_2020
Sally	1.0
Bob	5.0
Alice	8.5
Hank	4.0

Notice that we cover just the basics of XLSX.jl but more powerful usage and customizations are available. For more information and options, see the XLSX.jl documentation.

4 DataFrames.jl ← → 4.2 Index and Summarize

Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso