Sparklyr Issue with changing a csv file into a table format

Sparklyr Issue with changing a csv file into a table format



I am using Sparklyr to generate results using the Louvain Algorithm. I have two csv files one called nodes (13g) and one called edges (209g). I am trying to read both files into memory in R using spark_read_csv, and then using tbl_df to convert to a table format.



I am currently using the client approach of sparklyr, with the master set as “local” and the number of cores set to 15.



I have split the 13g csv file into two 6.2 and 6.3g’s as I was receiving recurring errors when trying to read the entire data set in.



The current configurations I have set for the sparklyr connections are given below:


conf$`sparklyr.cores.local` <- 15
conf$`sparklyr.shell.driver-memory` <- "240G"
conf$spark.executor.memory <- "240G"
conf$spark.memory.fraction <- 0.8
conf$spark.driver.cores <- 15
conf$spark.driver.maxResultSize <- "24G"



I currently have 515g of memory free on the server that I am using. I have set the sparklyr.shell.driver-memory to “240g” as this accounts for the amount of memory that is available on the computer minus the amount needed for operations. I have set the spark.executor.memory to be 24g and to run over 10 cores. The heartbeat timeout interval has been set to 10000000 to account for any timeout issues. I am receiving a java heap size error when trying to use the collect command on the 6.2G csv file.



I currently have 515G free on the server I am using. Both csv files together add up to 222g therefore I have set the executor memory to be a little greater than this. I have also set the driver memory to 250g to consider any performance the driver may have to deal with. The max heap size for the java options has also been set to 24g, however the java heap size has to do with the error I am receiving when trying to export my csv file into a table data frame.


library(dplyr)
library(igraph)
library(sparklyr)
library(data.table)
library(tidyverse)

#Install sparklyr version 2.0.0
spark_install(version = "2.0.0")

#Disconnect any previous sparklyr sessions
spark_disconnect_all()

#Create an empty frame for sc config
conf <- spark_config()

#The configurations for sparklyr
conf$spark.driver.cores <- 15
conf$spark.executor.memory <- '240g'
conf$spark.driver.memory <- '250g'
#conf$`spark.yarn.executor.memoryOverhead` <- "10g"
#conf$spark.driver.extraJavaOptions="-Xmx9g"
conf[["sparklyr.shell.conf"]] <- "spark.driver.extraJavaOptions=-
XX:MaxHeapSize=24G"

#Creates the sc connections
sc <- spark_connect(master = "local[15]",
version = "2.1.0",
config = conf)

#Reads in the nodes csv file
nodes <- spark_read_csv(sc, name = "nodes", path =
"/etl/louvain/R/cat/node_R.csv", header = TRUE)

#Turns the above into a table format
#This is where the error is produced
#Error: org.apache.spark.SparkException: Job 6 cancelled because
SparkContext was shut down

nodes1 <- tbl_df(nodes)
nodes_tbl <- copy_to(sc, nodes)

#Counts the node as below
nodes %>% count

#Creates four subsets of the nodes file
nodes2 <- nodes1 %>% slice(1:88805991)
nodes3 <- nodes1 %>% slice(88805991:177611982)
nodes4 <- nodes1 %>% slice(177611982:266417973)
nodes5 <- nodes1 %>% slice(266417973:355223963)

#Uses each of the above data frames and collects them to use as a R
frame

collect(nodes2)
collect(nodes3)
collect(nodes4)
collect(nodes5)

Nodes <- rbind(nodes2, nodes3, nodes4, nodes5, by = 'id')
nodesNew <- h[!duplicated(h$id),]

write.csv(nodesNew, file = "/etl/louvain/R/cat/nodesNew.csv",
row.names=FALSE)

edges <- spark_read_csv(sc, name = "edges", path =
"/etl/louvain/R/awk/newEdge.csv", header = TRUE)
edges1 <- tbl_df(edges)


#nodes <- distinct(nodes, "id")

el1 <- collect(edges)

el=as.matrix(el1)
el[,1]=as.character(el[,1])
el[,2]=as.character(el[,2])

install.packages("graphframes")
library(graphframes)
library(igraph)
clustergraph1 <- graph_from_data_frame(el, directed = FALSE,
vertices = n2)
#Assigns the louvain algorithm to the above graph
Community200k <- cluster_louvain(clustergraph1)
#Prints the values of each communty
print(Community200k)






Required, but never shown



Required, but never shown






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

𛂒𛀶,𛀽𛀑𛂀𛃧𛂓𛀙𛃆𛃑𛃷𛂟𛁡𛀢𛀟𛁤𛂽𛁕𛁪𛂟𛂯,𛁞𛂧𛀴𛁄𛁠𛁼𛂿𛀤 𛂘,𛁺𛂾𛃭𛃭𛃵𛀺,𛂣𛃍𛂖𛃶 𛀸𛃀𛂖𛁶𛁏𛁚 𛂢𛂞 𛁰𛂆𛀔,𛁸𛀽𛁓𛃋𛂇𛃧𛀧𛃣𛂐𛃇,𛂂𛃻𛃲𛁬𛃞𛀧𛃃𛀅 𛂭𛁠𛁡𛃇𛀷𛃓𛁥,𛁙𛁘𛁞𛃸𛁸𛃣𛁜,𛂛,𛃿,𛁯𛂘𛂌𛃛𛁱𛃌𛂈𛂇 𛁊𛃲,𛀕𛃴𛀜 𛀶𛂆𛀶𛃟𛂉𛀣,𛂐𛁞𛁾 𛁷𛂑𛁳𛂯𛀬𛃅,𛃶𛁼

How do I collapse sections of code in Visual Studio Code for Windows?

ャフサォクコ ケウ,コ,ワ メ,ロスョノ゙,クネ,フムカヤヲニ,エコ゚ツ ウイオン゙ケワサネォキモュキォウイノンコチ゚メヌナイゥフュ,カヒウネェ ネ,ホノケ,ムュキ ッボーミュハ,チ ツス ィ メウイマヤ,゙ウチ ヅ ロ,ォジヌェ ャヌット ェ,マャ,チナエヒネソキツテ トホヲヲミーァ