Ching-Chuan Chen's Blogger

Statistics, Machine Learning and Programming

0%

R data.table - subsets

data.table is a powerful tool for exploring data. However, how is it fast? Here we provides a performance test for subsetting data.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
library(data.table)
library(dplyr)
library(fastmatch)
library(Rcpp)
library(rbenchmark)
perf_test = function(N){
tmp <- list()
for(i in 1:N) tmp[[i]] <- iris
m <- do.call(rbind, tmp)
m2 = data.table(m)
setkey(m2, "Sepal.Width")
m3 = as.matrix(m[,1:4])
benchmark(replications=100, order = "relative",
data.frame = m[m$Sepal.Width == 3.5,],
subset = subset(m, Sepal.Width == 3.5),
dt1 = m2[J(3.5)],
filter_dt = filter(m, Sepal.Width == 3.5),
filter_df = filter(m2, Sepal.Width == 3.5),
dt2 = m2[list(3.5)],
fmatch = m2[fmatch(m2$Sepal.Width, 3.5, nomatch = 0L),],
matrix = m3[m3[,2]==3.5,],
columns = c("test", "replications", "elapsed", "relative")
)
}
# iris的大小
object.size(iris)
# 7088 bytes
# 200倍的資料量
perf_test(200)
# test replications elapsed relative
# 4 filter_dt 100 0.038 1.000
# 5 filter_df 100 0.088 2.316
# 6 dt2 100 0.119 3.132
# 3 dt1 100 0.131 3.447
# 8 matrix 100 0.134 3.526
# 7 fmatch 100 0.222 5.842
# 1 data.frame 100 0.407 10.711
# 2 subset 100 0.490 12.895

# 500倍的資料量
perf_test(500)
# test replications elapsed relative
# 4 filter_dt 100 0.083 1.000
# 5 filter_df 100 0.119 1.434
# 6 dt2 100 0.126 1.518
# 3 dt1 100 0.127 1.530
# 8 matrix 100 0.371 4.470
# 7 fmatch 100 0.517 6.229
# 1 data.frame 100 1.056 12.723
# 2 subset 100 1.224 14.747

# 1000倍的資料量
perf_test(1000)
# test replications elapsed relative
# 3 dt1 100 0.136 1.000
# 6 dt2 100 0.139 1.022
# 4 filter_dt 100 0.159 1.169
# 5 filter_df 100 0.194 1.426
# 8 matrix 100 0.809 5.949
# 7 fmatch 100 1.128 8.294
# 1 data.frame 100 2.157 15.860
# 2 subset 100 2.541 18.684

# 1500倍的資料量
perf_test(1500)
# test replications elapsed relative
# 3 dt1 100 0.144 1.000
# 6 dt2 100 0.148 1.028
# 4 filter_dt 100 0.259 1.799
# 5 filter_df 100 0.287 1.993
# 8 matrix 100 1.204 8.361
# 7 fmatch 100 1.543 10.715
# 1 data.frame 100 3.242 22.514
# 2 subset 100 3.729 25.896

# 3000倍的資料量
perf_test(3000)
# test replications elapsed relative
# 3 dt1 100 0.174 1.000
# 6 dt2 100 0.174 1.000
# 5 filter_df 100 0.405 2.328
# 4 filter_dt 100 0.509 2.925
# 8 matrix 100 2.441 14.029
# 7 fmatch 100 2.993 17.201
# 1 data.frame 100 6.428 36.943
# 2 subset 100 7.458 42.862

# 5000倍的資料量
perf_test(5000)
# test replications elapsed relative
# 6 dt2 100 0.224 1.000
# 3 dt1 100 0.225 1.004
# 5 filter_df 100 0.632 2.821
# 4 filter_dt 100 0.869 3.879
# 8 matrix 100 4.027 17.978
# 7 fmatch 100 4.797 21.415
# 1 data.frame 100 10.578 47.223
# 2 subset 100 12.177 54.362

After above benchmarks, we can see that filter in dplyr is fast when data size is low (lower than 10 MB), but data.table searching by key is faster when data size is larger. Fastmatch is not fast. HAHA!! Even data.frame is slower than matrix. data.table is so worth to learn!

My environment is ubuntu 14.04, R 3.1.1 compiled by intel c++, fortran compiler with MKL. My CPU is 3770K@4.3GHz.