Sparklyr Issue with changing a csv file into a table format

I am using Sparklyr to generate results using the Louvain Algorithm. I have two csv files one called nodes (13g) and one called edges (209g). I am trying to read both files into memory in R using spark_read_csv, and then using tbl_df to convert to a table format.

I am currently using the client approach of sparklyr, with the master set as “local” and the number of cores set to 15.

I have split the 13g csv file into two 6.2 and 6.3g’s as I was receiving recurring errors when trying to read the entire data set in.

The current configurations I have set for the sparklyr connections are given below:

conf$`sparklyr.cores.local` <- 15 conf$`sparklyr.shell.driver-memory` <- "240G" conf$spark.executor.memory <- "240G" conf$spark.memory.fraction <- 0.8 conf$spark.driver.cores <- 15 conf$spark.driver.maxResultSize <- "24G"

I currently have 515g of memory free on the server that I am using. I have set the sparklyr.shell.driver-memory to “240g” as this accounts for the amount of memory that is available on the computer minus the amount needed for operations. I have set the spark.executor.memory to be 24g and to run over 10 cores. The heartbeat timeout interval has been set to 10000000 to account for any timeout issues. I am receiving a java heap size error when trying to use the collect command on the 6.2G csv file.

I currently have 515G free on the server I am using. Both csv files together add up to 222g therefore I have set the executor memory to be a little greater than this. I have also set the driver memory to 250g to consider any performance the driver may have to deal with. The max heap size for the java options has also been set to 24g, however the java heap size has to do with the error I am receiving when trying to export my csv file into a table data frame.

library(dplyr) library(igraph) library(sparklyr) library(data.table) library(tidyverse) #Install sparklyr version 2.0.0 spark_install(version = "2.0.0") #Disconnect any previous sparklyr sessions spark_disconnect_all() #Create an empty frame for sc config conf <- spark_config() #The configurations for sparklyr conf$spark.driver.cores <- 15 conf$spark.executor.memory <- '240g' conf$spark.driver.memory <- '250g' #conf$`spark.yarn.executor.memoryOverhead` <- "10g" #conf$spark.driver.extraJavaOptions="-Xmx9g" conf[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=- XX:MaxHeapSize=24G" #Creates the sc connections sc <- spark_connect(master = "local[15]", version = "2.1.0", config = conf) #Reads in the nodes csv file nodes <- spark_read_csv(sc, name = "nodes", path = "/etl/louvain/R/cat/node_R.csv", header = TRUE) #Turns the above into a table format #This is where the error is produced #Error: org.apache.spark.SparkException: Job 6 cancelled because SparkContext was shut down nodes1 <- tbl_df(nodes) nodes_tbl <- copy_to(sc, nodes) #Counts the node as below nodes %>% count #Creates four subsets of the nodes file nodes2 <- nodes1 %>% slice(1:88805991) nodes3 <- nodes1 %>% slice(88805991:177611982) nodes4 <- nodes1 %>% slice(177611982:266417973) nodes5 <- nodes1 %>% slice(266417973:355223963) #Uses each of the above data frames and collects them to use as a R frame collect(nodes2) collect(nodes3) collect(nodes4) collect(nodes5) Nodes <- rbind(nodes2, nodes3, nodes4, nodes5, by = 'id') nodesNew <- h[!duplicated(h$id),] write.csv(nodesNew, file = "/etl/louvain/R/cat/nodesNew.csv", row.names=FALSE) edges <- spark_read_csv(sc, name = "edges", path = "/etl/louvain/R/awk/newEdge.csv", header = TRUE) edges1 <- tbl_df(edges) #nodes <- distinct(nodes, "id") el1 <- collect(edges) el=as.matrix(el1) el[,1]=as.character(el[,1]) el[,2]=as.character(el[,2]) install.packages("graphframes") library(graphframes) library(igraph) clustergraph1 <- graph_from_data_frame(el, directed = FALSE, vertices = n2) #Assigns the louvain algorithm to the above graph Community200k <- cluster_louvain(clustergraph1) #Prints the values of each communty print(Community200k)

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt

Sparklyr Issue with changing a csv file into a table format

Sparklyr Issue with changing a csv file into a table format

Popular posts from this blog

How do I collapse sections of code in Visual Studio Code for Windows?