Sparklyr Issue with changing a csv file into a table format
Sparklyr Issue with changing a csv file into a table format
I am using Sparklyr to generate results using the Louvain Algorithm. I have two csv files one called nodes (13g) and one called edges (209g). I am trying to read both files into memory in R using spark_read_csv, and then using tbl_df to convert to a table format.
I am currently using the client approach of sparklyr, with the master set as “local” and the number of cores set to 15.
I have split the 13g csv file into two 6.2 and 6.3g’s as I was receiving recurring errors when trying to read the entire data set in.
The current configurations I have set for the sparklyr connections are given below:
conf$`sparklyr.cores.local` <- 15
conf$`sparklyr.shell.driver-memory` <- "240G"
conf$spark.executor.memory <- "240G"
conf$spark.memory.fraction <- 0.8
conf$spark.driver.cores <- 15
conf$spark.driver.maxResultSize <- "24G"
I currently have 515g of memory free on the server that I am using. I have set the sparklyr.shell.driver-memory to “240g” as this accounts for the amount of memory that is available on the computer minus the amount needed for operations. I have set the spark.executor.memory to be 24g and to run over 10 cores. The heartbeat timeout interval has been set to 10000000 to account for any timeout issues. I am receiving a java heap size error when trying to use the collect command on the 6.2G csv file.
I currently have 515G free on the server I am using. Both csv files together add up to 222g therefore I have set the executor memory to be a little greater than this. I have also set the driver memory to 250g to consider any performance the driver may have to deal with. The max heap size for the java options has also been set to 24g, however the java heap size has to do with the error I am receiving when trying to export my csv file into a table data frame.
library(dplyr)
library(igraph)
library(sparklyr)
library(data.table)
library(tidyverse)
#Install sparklyr version 2.0.0
spark_install(version = "2.0.0")
#Disconnect any previous sparklyr sessions
spark_disconnect_all()
#Create an empty frame for sc config
conf <- spark_config()
#The configurations for sparklyr
conf$spark.driver.cores <- 15
conf$spark.executor.memory <- '240g'
conf$spark.driver.memory <- '250g'
#conf$`spark.yarn.executor.memoryOverhead` <- "10g"
#conf$spark.driver.extraJavaOptions="-Xmx9g"
conf[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-
XX:MaxHeapSize=24G"
#Creates the sc connections
sc <- spark_connect(master = "local[15]",
version = "2.1.0",
config = conf)
#Reads in the nodes csv file
nodes <- spark_read_csv(sc, name = "nodes", path =
"/etl/louvain/R/cat/node_R.csv", header = TRUE)
#Turns the above into a table format
#This is where the error is produced
#Error: org.apache.spark.SparkException: Job 6 cancelled because
SparkContext was shut down
nodes1 <- tbl_df(nodes)
nodes_tbl <- copy_to(sc, nodes)
#Counts the node as below
nodes %>% count
#Creates four subsets of the nodes file
nodes2 <- nodes1 %>% slice(1:88805991)
nodes3 <- nodes1 %>% slice(88805991:177611982)
nodes4 <- nodes1 %>% slice(177611982:266417973)
nodes5 <- nodes1 %>% slice(266417973:355223963)
#Uses each of the above data frames and collects them to use as a R
frame
collect(nodes2)
collect(nodes3)
collect(nodes4)
collect(nodes5)
Nodes <- rbind(nodes2, nodes3, nodes4, nodes5, by = 'id')
nodesNew <- h[!duplicated(h$id),]
write.csv(nodesNew, file = "/etl/louvain/R/cat/nodesNew.csv",
row.names=FALSE)
edges <- spark_read_csv(sc, name = "edges", path =
"/etl/louvain/R/awk/newEdge.csv", header = TRUE)
edges1 <- tbl_df(edges)
#nodes <- distinct(nodes, "id")
el1 <- collect(edges)
el=as.matrix(el1)
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])
install.packages("graphframes")
library(graphframes)
library(igraph)
clustergraph1 <- graph_from_data_frame(el, directed = FALSE,
vertices = n2)
#Assigns the louvain algorithm to the above graph
Community200k <- cluster_louvain(clustergraph1)
#Prints the values of each communty
print(Community200k)
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.