Sparklyr - Table

1 - About

One a connection is made, the table (or data frame) is manipulated with R - Dplyr (Data Frame Operations)

3 - Management

3.1 - Initialize

3.1.1 - Local (Load)

  • Load the iris data set into Spark. The new object will be temporary, limited to the current connection to the source.

iris_tbl <- dplyr::copy_to(sc, iris)
flights_tbl <- dplyr::copy_to(sc, nycflights13::flights, "flights")

3.1.2 - Remote


flights_tbl <- tbl(sc, from="flights")

3.2 - List

  • You can see them in the Spark view

  • List the tables

dplyr::src_tbls(sc)


[1] "iris"


sample_tbl <- dplyr::tbl(sc, from="hivesampletable")
head(sample_tbl)


# Source:   lazy query [?? x 11]
# Database: spark_connection
  clientid querytime market deviceplatform devicemake devicemodel state  country querydwelltime sessionid
  <chr>    <chr>     <chr>  <chr>          <chr>      <chr>       <chr>  <chr>            <dbl>     <dbl>
1 8        18:54:20  en-US  Android        Samsung    SCH-i500    Calif~ United~         13.9           0
2 23       19:19:44  en-US  Android        HTC        Incredible  Penns~ United~        NaN             0
3 23       19:19:46  en-US  Android        HTC        Incredible  Penns~ United~          1.48          0
4 23       19:19:47  en-US  Android        HTC        Incredible  Penns~ United~          0.246         0
5 28       01:37:50  en-US  Android        Motorola   Droid X     Color~ United~         20.3           1
6 28       00:53:31  en-US  Android        Motorola   Droid X     Color~ United~         16.3           0
# ... with 1 more variable: sessionpagevieworder <dbl>

3.4 - Query


sample_tbl %>% 
  group_by(market) %>%
  summarise(count = n(), queryDwellTime = mean(querydwelltime)) %>%
  filter(count > 20, querydwelltime > 30) %>%
  collect


# A tibble: 11 x 3
   market count queryDwellTime
   <chr>  <dbl>          <dbl>
 1 es-ES     30          110. 
 2 en-CA     71           60.1
 3 en-IN     37           80.7
 4 it-IT     33           80.8
 5 fr-FR     55           45.9
 6 zh-CN    101           45.8
 7 en-GB   1817           82.5
 8 de-DE     52           63.9
 9 da-DK     31           51.8
10 en-AU     53           76.3
11 en-US  57303        27791. 
Warning message:
Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning 

4 - Support

4.1 - Livy - Message (54876255 bytes) exceeds maximum allowed size (52428800 bytes)

When trying to load a big data frame such as the flights data, you may get


Error in livy_validate_http_response("Failed to invoke livy statement",  : 
  Failed to invoke livy statement (Server error: (500) Internal Server Error): "java.util.concurrent.ExecutionException: io.netty.handler.codec.EncoderException: java.lang.IllegalArgumentException: Message (54876255 bytes) exceeds maximum allowed size (52428800 bytes)."

This maximum size is specified in the parameter. livy.rsc.rpc.max.size. See Configuring the rpc.max.size setting.

It must be set on the system and session scope. Unfortunately, it seems that you can't set it with Sparklyr. There is no conf parameters in the livy_conf function to set it on the session level.


Data Science
Data Analysis
Statistics
Data Science
Linear Algebra Mathematics
Trigonometry

Powered by ComboStrap