About
One a connection is made, the table (or data frame) is manipulated with R - Dplyr (Data Frame Operations)
Articles Related
Management
Initialize
Local (Load)
- Load the iris data set into Spark. The new object will be temporary, limited to the current connection to the source.
iris_tbl <- dplyr::copy_to(sc, iris)
flights_tbl <- dplyr::copy_to(sc, nycflights13::flights, "flights")
Remote
flights_tbl <- tbl(sc, from="flights")
List
- You can see them in the Spark view
- List the tables
dplyr::src_tbls(sc)
[1] "iris"
Head
sample_tbl <- dplyr::tbl(sc, from="hivesampletable")
head(sample_tbl)
# Source: lazy query [?? x 11]
# Database: spark_connection
clientid querytime market deviceplatform devicemake devicemodel state country querydwelltime sessionid
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 8 18:54:20 en-US Android Samsung SCH-i500 Calif~ United~ 13.9 0
2 23 19:19:44 en-US Android HTC Incredible Penns~ United~ NaN 0
3 23 19:19:46 en-US Android HTC Incredible Penns~ United~ 1.48 0
4 23 19:19:47 en-US Android HTC Incredible Penns~ United~ 0.246 0
5 28 01:37:50 en-US Android Motorola Droid X Color~ United~ 20.3 1
6 28 00:53:31 en-US Android Motorola Droid X Color~ United~ 16.3 0
# ... with 1 more variable: sessionpagevieworder <dbl>
Query
sample_tbl %>%
group_by(market) %>%
summarise(count = n(), queryDwellTime = mean(querydwelltime)) %>%
filter(count > 20, querydwelltime > 30) %>%
collect
# A tibble: 11 x 3
market count queryDwellTime
<chr> <dbl> <dbl>
1 es-ES 30 110.
2 en-CA 71 60.1
3 en-IN 37 80.7
4 it-IT 33 80.8
5 fr-FR 55 45.9
6 zh-CN 101 45.8
7 en-GB 1817 82.5
8 de-DE 52 63.9
9 da-DK 31 51.8
10 en-AU 53 76.3
11 en-US 57303 27791.
Warning message:
Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
Support
Livy - Message (54876255 bytes) exceeds maximum allowed size (52428800 bytes)
When trying to load a big data frame such as the flights data, you may get
Error in livy_validate_http_response("Failed to invoke livy statement", :
Failed to invoke livy statement (Server error: (500) Internal Server Error): "java.util.concurrent.ExecutionException: io.netty.handler.codec.EncoderException: java.lang.IllegalArgumentException: Message (54876255 bytes) exceeds maximum allowed size (52428800 bytes)."
This maximum size is specified in the parameter. livy.rsc.rpc.max.size. See Configuring the rpc.max.size setting.
It must be set on the system and session scope. Unfortunately, it seems that you can't set it with Sparklyr. There is no conf parameters in the livy_conf function to set it on the session level.