Published on

A SQL to Julia DataFrames Cheatsheet

Authors
  • avatar
    Name
    Nelson Tang
    Twitter

A SQL to Julia DataFrames dictionary using DataFrames and DataFramesMeta. In general, I use DataFramesMeta since it abstracts away some lower level nuance and it makes for a tidier workflow when I'm constructing queries.

Overview

Querying in DataFrames and DataFramesMeta

SQL ClauseDataFrames EquivalentDataFramesMeta Equivalent
SELECTselect@select
WHEREsubset(df, :colnm => ByRow(x -> x>1))@rsubset(df, :colnm > 1)
LIKEsubset(df, :colnm=>ByRow(x -> occursin(r"string", x)))@rsubset(df, occursin(r"string", :colnm))
GROUP BYgd = groupby(df, :colnm)
combine(gd, :colnm=>sum)
gd = groupby(df, :colnm)
@combine(gd, :newcol=sum(:colnm))
or
@by(df, :colnm, :newcol=sum(:colnm))
SUM(colnm) OVER (PARTITION BY other_colnm) AS newcolgd = groupby(df, :colnm)
transform(gd, :colnm=>sum)
gd = groupby(df, :colnm)
@transform(gd, :newcol=sum(:colnm))
ORDER BY colnm ASCsort(df, :colnm)@orderby(df, :colnm)
ORDER BY colnm DESCsort(df, :colnm, rev=true)@orderby(df, sort(:colnm, rev=true)

Joins

SQL ClauseDataFrames Equivalent
INNER JOINinnerjoin(df, df2, on=:colnm)
innerjoin(df, df2, on=[:left=>:right])
LEFT JOINleftjoin(df, df2, on=:colnm)
leftjoin(df, df2, on=[:left=>:right])
RIGHT JOINrightjoin(df, df2, on=:colnm)
rightjoin(df, df2, on=[:left=>:right])
OUTER JOINouterjoin(df, df2, on=:colnm)
outerjoin(df, df2, on=[:left=>:right])
SELECT * FROM table1, table2
aka cartesian product or crossjoin
crossjoin

A Worked Example

The Dataset

We'll manually create a dataset of employees df:

using CSV, DataFramesMeta, Statistics, Dates

# DataFrame(column=data)
df = DataFrame(id=1:8,
               first_name=["Michael", "Dwight", "Angela", "Jim", "Pam", "Oscar", "Meredith", "Creed"],
               last_name=["Scott", "Schrute", "Martin", "Halpert", "Beesly", "Nunez", "Palmer", "Bratton"],
               department=["Management & Admin", "Sales", "Accounting", "Sales", "Management & Admin", "Accounting",
                           "Purchasing", "Purchasing"],
               salary=[5100, 4200, 3750, 4300, 2200, 3400, 3300, 3200])

8×5 DataFrame
 Row │ id     first_name  last_name  department          salary
     │ Int64  String      String     String              Int64
─────┼──────────────────────────────────────────────────────────
   11  Michael     Scott      Management & Admin    5100
   22  Dwight      Schrute    Sales                 4200
   33  Angela      Martin     Accounting            3750
   44  Jim         Halpert    Sales                 4300
   55  Pam         Beesly     Management & Admin    2200
   66  Oscar       Nunez      Accounting            3400
   77  Meredith    Palmer     Purchasing            3300
   88  Creed       Bratton    Purchasing            3200

Let's create a sales database called db_sales with client information (thanks to this site)

# Parse dates as Date objects
dates = ["1-2-2006", "1-29-2006", "2-1-2006", "2-14-2006", "3-1-2006", "3-20-2006"]
dates = parse.(Date, dates, dateformat"m-d-y")

db_sales = DataFrame(id=1:6,
                     transaction_date=dates,
                     employee_id=[4, 2, 4, 2, 4, 2],
                     quantity=[100, 500, 600, 200, 400, 250],
                     customer=["Dunmore High School", "Harper Collins", "Blue Cross of Pennsylvania",
                                "Apex Technology", "Blue Cross of Pennsylvania",
                                "Stone, Cooper, and Grandy: Attorneys at Law"])

6×5 DataFrame
 Row │ id     transaction_date  employee_id  quantity  customer
     │ Int64  Date              Int64        Int64     String
─────┼───────────────────────────────────────────────────────────────────────────────────
   11  2006-01-02                  4       100  Dunmore High School
   22  2006-01-29                  2       500  Harper Collins
   33  2006-02-01                  4       600  Blue Cross of Pennsylvania
   44  2006-02-14                  2       200  Apex Technology
   55  2006-03-01                  4       400  Blue Cross of Pennsylvania
   66  2006-03-20                  2       250  Stone, Cooper, and Grandy: Attor…

Subsetting Rows

Subsetting rows is possible in base DataFrames, but the syntax in DataFramesMeta is easier for beginners to follow. The special @rsubset macro saves on having to write anonymous functions so it's one less syntactical thing to keep typing every time.

#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))

#DataFramesMeta
@rsubset(df, occursin("Admin", :department))

2×5 DataFrame
 Row │ id     first_name  last_name  department          salary
     │ Int64  String      String     String              Int64
─────┼──────────────────────────────────────────────────────────
   11  Michael     Scott      Management & Admin    5100
   25  Pam         Beesly     Management & Admin    2200

Matching Text

What if you want to do some string matching with wildcards, i.e. SQL WHERE clause with the LIKE or % operator?

We can use the occursin() function and pass it as an argument to @rsubset, like:

#DataFrames
subset(df, :department => ByRow(x -> occursin("Admin", x)))

#DataFramesMeta
@rsubset(df, occursin("Admin", :department))

2×5 DataFrame
 Row │ id     first_name  last_name  department          salary
     │ Int64  String      String     String              Int64
─────┼──────────────────────────────────────────────────────────
   11  Michael     Scott      Management & Admin    5100
   25  Pam         Beesly     Management & Admin    2200

Adding the r in front of the string lets you use regex to use wildcards and more complex string matching criteria.

Aggregation

i.e. GROUP BY Column and SUM(column)

For regular GROUP BY you first use groupby() and then either combine or @combine, or you can use the @by function as shorthand.

Grouping with @by

@by(df, :department,
        :"Average Salary" = mean(:salary),
        :count=length(:salary))

4×2 DataFrame
 Row │ department          Average Salary
     │ String              Float64
─────┼────────────────────────────────────
   1 │ Management & Admin          3650.0
   2 │ Sales                       4250.0
   3 │ Accounting                  3575.0
   4 │ Purchasing                  3250.0

Grouping with combine and @combine

gd = groupby(df, :department)
#DataFrames
combine(gd, :salary => mean => :"Average Salary",
            :department => length => :count)

#DataFramesMeta
@combine(gd, :"Average Salary" = mean(:salary),
             :count = length(:department))

4×2 DataFrame
 Row │ department          Average Salary
     │ String              Float64
─────┼────────────────────────────────────
   1 │ Management & Admin          3650.0
   2 │ Sales                       4250.0
   3 │ Accounting                  3575.0
   4 │ Purchasing                  3250.0

Window Functions with @transform

In SQL:

AVG(colnm) OVER (PARTITION BY other_colnm) AS newcol

i.e. the PARTITION BY clause, similar to groupby but it returns a value for each row in your table after doing the aggregations in each partition.

In Julia, you can use the @transform macro to do this after grouping.

# Example: add a column that
gd = groupby(df, :department)
#DataFrames
transform(gd, :salary => mean => :"avg_dept_salary")

#DataFramesMeta
@transform(gd, :"avg_dept_salary"=mean(:salary))

8×6 DataFrame
 Row │ id     first_name  last_name  department          salary  avg_dept_salary
     │ Int64  String      String     String              Int64   Float64
─────┼──────────────────────────────────────────────────────────────────────────
   11  Michael     Scott      Management & Admin    5100          3650.0
   22  Dwight      Schrute    Sales                 4200          4250.0
   33  Angela      Martin     Accounting            3750          3575.0
   44  Jim         Halpert    Sales                 4300          4250.0
   55  Pam         Beesly     Management & Admin    2200          3650.0
   66  Oscar       Nunez      Accounting            3400          3575.0
   77  Meredith    Palmer     Purchasing            3300          3250.0
   88  Creed       Bratton    Purchasing            3200          3250.0

And we see indeed that we have the same number of rows as the initial dataset.

Putting together a query with @chain

sales = @chain db_sales begin
    groupby(:employee_id)
    @combine(:total_quantity=sum(:quantity),
             :number_of_customers=length(:customer))
end

result = @chain df begin
                @select(:id, :first_name, :last_name, :department)
                @rsubset(:department=="Sales")
                leftjoin(sales, on=[:id=>:employee_id])
                @orderby(sort(:total_quantity, rev=true))  # descending
          end

2×6 DataFrame
 Row │ id     first_name  last_name  department  total_quantity  number_of_customers
     │ Int64  String      String     String      Int64?          Int64?
─────┼───────────────────────────────────────────────────────────────────────────────
   14  Jim         Halpert    Sales                 1100                    3
   22  Dwight      Schrute    Sales                  950                    3

Alternative: Use DuckDB

Another amazing alternative is to use DuckDB to query dataframes, CSV files, parquet files, etc directly with SQL.

Assuming we had the dataframes created above, we'd query it with SQL with a few lines:

using DuckDB

# create a new in-memory dabase
con = DBInterface.connect(DuckDB.DB)

# register it as a view in the database
DuckDB.register_data_frame(con, df, "my_df")

queryStr = """
SELECT
first_name
, last_name
, department
, salary
, AVG(salary) OVER (PARTITION BY department) AS avg_dept_salary
FROM my_df
"""

# run a SQL query over the DataFrame
results = DBInterface.execute(con, queryStr)
print(results)

# 8×5 DataFrame
#  Row │ first_name  last_name  department          salary  avg_dept_salary
#      │ String?     String?    String?             Int64?  Float64?
# ─────┼────────────────────────────────────────────────────────────────────
#    1 │ Angela      Martin     Accounting            3750           3575.0
#    2 │ Oscar       Nunez      Accounting            3400           3575.0
#    3 │ Michael     Scott      Management & Admin    5100           3650.0
#    4 │ Pam         Beesly     Management & Admin    2200           3650.0
#    5 │ Meredith    Palmer     Purchasing            3300           3250.0
#    6 │ Creed       Bratton    Purchasing            3200           3250.0
#    7 │ Dwight      Schrute    Sales                 4200           4250.0
#    8 │ Jim         Halpert    Sales                 4300           4250.0