Whereas filter
removes rows, select
removes columns. However, select
is much more versatile than just removing columns, as we will discuss in this section. First, let’s create a dataset with multiple columns:
function responses()
id = [1, 2]
q1 = [28, 61]
q2 = [:us, :fr]
q3 = ["F", "B"]
q4 = ["B", "C"]
q5 = ["A", "E"]
DataFrame(; id, q1, q2, q3, q4, q5)
end
responses()
id | q1 | q2 | q3 | q4 | q5 |
---|---|---|---|---|---|
1 | 28 | us | F | B | A |
2 | 61 | fr | B | C | E |
Here, the data represents answers for five questions (q1
, q2
, …, q5
) in a given questionnaire. We will start by “selecting” a few columns from this dataset. As usual, we use symbols to specify columns:
select(responses(), :id, :q1)
id | q1 |
---|---|
1 | 28 |
2 | 61 |
We can also use strings if we want:
select(responses(), "id", "q1", "q2")
id | q1 | q2 |
---|---|---|
1 | 28 | us |
2 | 61 | fr |
Additionally, we can use Regular Expressions with Julia’s regex string literal. A string literal in Julia is a prefix that you use while constructing a String
. For example, the regex string literal can be created with r"..."
where ...
is the Regular Expression. For example, suppose you only want to select the columns that start with q
:
select(responses(), r"^q")
q1 | q2 | q3 | q4 | q5 |
---|---|---|---|---|
28 | us | F | B | A |
61 | fr | B | C | E |
NOTE: We won’t cover regular expressions in this book, but you are encouraged to learn about them. To build and test regular expressions interactively, we advice to use online tools for them such as https://regex101.com/.
To select everything except one or more columns, use Not
with either a single column:
select(responses(), Not(:q5))
id | q1 | q2 | q3 | q4 |
---|---|---|---|---|
1 | 28 | us | F | B |
2 | 61 | fr | B | C |
Or, with multiple columns:
select(responses(), Not([:q4, :q5]))
id | q1 | q2 | q3 |
---|---|---|---|
1 | 28 | us | F |
2 | 61 | fr | B |
It’s also fine to mix and match columns that we want to preserve with columns that we do Not
want to select:
select(responses(), :q5, Not(:q5))
q5 | id | q1 | q2 | q3 | q4 |
---|---|---|---|---|---|
A | 1 | 28 | us | F | B |
E | 2 | 61 | fr | B | C |
Note how q5
is now the first column in the DataFrame
returned by select
. There is a more clever way to achieve the same using :
. The colon :
can be thought of as “all the columns that we didn’t include yet”. For example:
select(responses(), :q5, :)
q5 | id | q1 | q2 | q3 | q4 |
---|---|---|---|---|---|
A | 1 | 28 | us | F | B |
E | 2 | 61 | fr | B | C |
Or, to put q5
at the second position16:
select(responses(), 1, :q5, :)
id | q5 | q1 | q2 | q3 | q4 |
---|---|---|---|---|---|
1 | A | 28 | us | F | B |
2 | E | 61 | fr | B | C |
NOTE: As you might have observed there are several ways to select a column. These are known as column selectors.
We can use:
Symbol
:select(df, :col)
String
:select(df, "col")
Integer
:select(df, 1)
RegEx
:select(df, r"RegEx")
Even renaming columns is possible via select
using the source => target
pair syntax:
select(responses(), 1 => "participant", :q1 => "age", :q2 => "nationality")
participant | age | nationality |
---|---|---|
1 | 28 | us |
2 | 61 | fr |
Additionally, thanks to the “splat” operator ...
(see Section 3.3.11), we can also write:
renames = (1 => "participant", :q1 => "age", :q2 => "nationality")
select(responses(), renames...)
participant | age | nationality |
---|---|---|
1 | 28 | us |
2 | 61 | fr |