Issue with HBase in spark streaming
I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter =>
val context = TaskContext.get
logger.info((s"Process for partition: $context.partitionId "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
)
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks
apache-spark hbase streaming
add a comment |
I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter =>
val context = TaskContext.get
logger.info((s"Process for partition: $context.partitionId "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
)
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks
apache-spark hbase streaming
add a comment |
I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter =>
val context = TaskContext.get
logger.info((s"Process for partition: $context.partitionId "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
)
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks
apache-spark hbase streaming
I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.
val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter =>
val context = TaskContext.get
logger.info((s"Process for partition: $context.partitionId "))
val hbaseConf = HBaseConfiguration.create()
//hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL)
//val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
val connection = ConnectionFactory.createConnection(hbaseConf)
val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))
.......
)
I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.
Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet
Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)
Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.
Thanks
apache-spark hbase streaming
apache-spark hbase streaming
edited Nov 12 '18 at 3:16
Indira
asked Nov 11 '18 at 21:23
IndiraIndira
415
415
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53253365%2fissue-with-hbase-in-spark-streaming%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53253365%2fissue-with-hbase-in-spark-streaming%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown