Data Filtering In Julia: Everything You Need To Know | by Emma Boudreau | Jul, 2023


One of the great things about Julia is the language’s extensibility. With Julia, all of the modules can utilize the functions provided by the Base module to add new methods. In other words, modules can blend seemlessly with the Base and often be treated like Base types. This means that if we learn how these methods work with Base we will probably be able to carry a lot of that knowledge with us into other modules. Today we will demonstrate this by starting with Base and then expanding into filtering a different structure from a dependency, a DataFrame from DataFrames .

Filtering base types

There are a few different techniques that can be used to filter a simple Vector . One feature that I think is relatively new to Julia is the ability to provide conditional masks as indexes. I am not sure how long this has been included with Base , but this is certainly an awesome feature, as I love conditional masks. To create a conditional mask, we need to make one of the BitArrays we talked about earlier. In this instance, we will broadcast a comparison operator again. Here we will filter any value above 14 out of x:

x = [5, 10, 15, 20]
xmask = x .< 14
x[xmask]
2-element Vector{Int64}:
5
10

Alternatively, we could utilize the filter methods. These are filter and filter! . These two methods do the same exact thing, the only difference is that filter! is a mutating method. This is precisely what the ! in function names is meant to represent. I find that to be a really cool standard as it does certainly make it easier to discern when things are being mutated and when they are not. I think that is a great thing to know, especially when it comes to Data Science. The filter method is provided with a Function as the first positional argument and then our Vector as the second positional argument. This might change slightly if the type is not a Vector , so keep that in mind.

filter(x::Int64 -> x < 14, x)

2-element Vector{Int64}:
5
10

Given that we used filter instead of filter! here, we would need to set x equal to the return to enforce these changes. Another thing we can filter using this technique is dictionaries. Rather than providing the type of each element in the Vector , we instead work with a Pair .

mydict = Dict(:A => [5, 10], :B => [4, 10])

filter(k::Pair{Symbol, Vector{Int64}} -> k[2][1] != 5, mydict)

Dict{Symbol, Vector{Int64}} with 1 entry:
:B => [4, 10]

Because the function is the first positional argument, this also opens up the ability to utilize the do syntax, so definitely keep this in mind.

x = [5, 10, nothing, nothing, 40]

filter!(x) do number
~(isnothing(number))
end

3-element Vector{Union{Nothing, Int64}}:
5
10
40

Filtering dataframes

Another common type of structure that might need to be filtered is the DataFrame . This is a bit different because it is a dependency and a module, not just a portion of Base .

using DataFrames

df = DataFrame(:X => [1, 2, 3, 4], :Y => [1, 2, 3, 4])

The filter method when used on a DataFrame will provide a DataFrameRow to the function. This is a cool type, we can index it pretty easily and this makes filtering a breeze.

filter!(df) do row
if row[:X] > 3
return(false)
end
true
end

That really is all there is to it, and with the preexisting knowledge from Base , it might be hard to find things that are not possible to filter with this technique!



Source link

Leave a Comment