R - Data Table

Card Puncher Data Processing

About

A data table is an enhanced data.frame.

data.tables (and data.frames) are internally lists as well, but with all its columns of equal length and with a class attribute.

Create

  • data.table function
DT = data.table(
    ID = c("b","b","b","a","a","c"), 
    a = 1:6, 
    b = 7:12, 
    c = 13:18)
  • fread()

Syntax

  • Take DT, filter rows using i, then calculate j, grouped by by.
DT[i, j, (by|keyby)]

    DT[ i,  j,  by ] # + extra arguments
        |   |   |
        |   |    -------> grouped by what?
        |    -------> what to do?
         ---> on which rows?
R SQL
i where (subset, filtering) and order with the order() function
j select columns
* the data.table way: DT[, .(colA, colB)]
* the data.frame way: DT[, c(“colA”, “colB”), with = FALSE]
j compute DT[, .(sumA = sum(colA), meanB = mean(colB))]
by group by
keyby group by ordered

Filter (Subset)

  • Columns can be referred to as if they are variables. In data.frames a comma at the end is necessary but not here
ans <- flights[origin == "JFK" & month == 6L]
  • Indexing. Get the first two rows from flights.
ans <- flights[1:2]

Order

Without group by

Sort flights first by column origin in ascending order, and then by dest in descending order

ans <- flights[order(origin, -dest)]

With group by

  • data.table retains the original order of groups (by design) but we can force it with the keyby keyword.
flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        keyby = .(origin, dest, month)]

keyby() is applied after performing the operation, i.e., on the computed result and sets a key after ordering by setting an attribute called sorted.

Select

  • Select one column
# return it as a vector.
ans <- flights[, arr_delay]
# return as a data.table 
ans <- flights[, list(arr_delay)]
ans <- flights[, .(arr_delay)] # The point is an alias to list.
  • Select two columns
ans <- flights[, .(arr_delay, dep_delay)]
  • Specify alias (ie rename them)
ans <- flights[, .(delay_arr = arr_delay, delay_dep = dep_delay)]

Compute

  • How many trips have had total delay < 0?. Computation gets by default the name Vn (ie V1, V2,…)
ans <- flights[, sum((arr_delay + dep_delay) < 0)]
  • Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.
ans <- flights[origin == "JFK" & month == 6L,
               .(m_arr = mean(arr_delay), m_dep = mean(dep_delay))]
  • How many trips have been made in 2014 from “JFK” airport in the month of June?
ans <- flights[origin == "JFK" & month == 6L, length(dest)] # length = count (can have any argument)
ans <- flights[origin == "JFK" & month == 6L, .N] #.N is an alias for length

Grouping

  • the number of trips corresponding to each origin airport?
ans <- flights[, .(.N), by = .(origin)]

* the number of trips for each origin airport for carrier code “AA”

ans <- flights[carrier == "AA", .N, by = origin]
  • the total number of trips for each origin, dest pair for carrier code “AA”
ans <- flights[carrier == "AA", .N, by = .(origin,dest)]
  • the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”
ans <- flights[carrier == "AA",
        .(mean(arr_delay), mean(dep_delay)),
        by = .(origin, dest, month)]
  • Expression in the group by statement (output will be two boolean columns)
ans <- flights[, .N, .(dep_delay>0, arr_delay>0)]

Subset of Data (Partition)

.SD for Subset of Data:

  • is a data.table that holds the data for the current group defined using by.
  • contains all the columns except the grouping columns by default.
  • preserves the original order

Example:

  • print
DT[, print(.SD), by = col]
  • compute on (multiple) columns with lapply.
DT[, lapply(.SD, mean), by = ID]
  • specify the SD columns (By default, it contains all the columns other than the grouping variables)
flights[carrier == "AA",                       ## Only on trips with carrier "AA"
        lapply(.SD, mean),                     ## compute the mean
        by = .(origin, dest, month),           ## for every 'origin,dest,month'
        .SDcols = c("arr_delay", "dep_delay")] ## for just those specified in .SDcols
  • the first two rows for each month?
ans <- flights[, head(.SD, 2), by = month]

Chaining

  • Group by + order by
ans <- flights[carrier == "AA", .N, by = .(origin, dest)][order(origin, -dest)]
  • The expression can also be chained vertically.
DT[ ...
  ][ ...
  ][ ...
  ]





Discover More
Model Funny
Function - Cumulative Function

With a where: the first pass calculate the sum of an account for a date and order the set the second pass calculate the cumulatif sum on the partition (.SD) by account
Card Puncher Data Processing
R - List

The list in R may contain elements of the different class (just like a data frame) class a vector (1 dimension) or a matrix (2 dimensions) data.tables and data.frames are internally lists with all...
Card Puncher Data Processing
R - TopN Analysis

One column All column the first two rows for each month with a



Share this page:
Follow us:
Task Runner