Issue with HBase in spark streaming

I have issue with the performance when reading data from HBase in spark streaming. It is taking more than 5 mins just to read data from HBase for 3 records. Below is the logic that I used in mapPartitions.

val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => 

 val context = TaskContext.get
 logger.info((s"Process for partition: $context.partitionId "))

 val hbaseConf = HBaseConfiguration.create()
 //hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL) 
 //val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
 val connection = ConnectionFactory.createConnection(hbaseConf)
 val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))

 .......
 )

I have used BulkGet. It is taking around 5 seconds to process 90K messages(may be because the API is using HBaseContext and we dont have create any HBaseConnection). But I cannot use this as the output of BulkGet is RDD, and I have to do leftouterjoin to join the RDD of BulkGet with the actual RDD from Kafka. I assume this is not correct approach as it involves the below. Moreover I have to process all the 90K messages in 1 second.

Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet

Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)

Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.

Thanks

edited Nov 12 '18 at 3:16

asked Nov 11 '18 at 21:23

Indira

415

add a comment |

val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => 

 val context = TaskContext.get
 logger.info((s"Process for partition: $context.partitionId "))

 val hbaseConf = HBaseConfiguration.create()
 //hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL) 
 //val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
 val connection = ConnectionFactory.createConnection(hbaseConf)
 val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))

 .......
 )

Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet

Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)

Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.

Thanks

edited Nov 12 '18 at 3:16

asked Nov 11 '18 at 21:23

Indira

415

add a comment |

val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => 

 val context = TaskContext.get
 logger.info((s"Process for partition: $context.partitionId "))

 val hbaseConf = HBaseConfiguration.create()
 //hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL) 
 //val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
 val connection = ConnectionFactory.createConnection(hbaseConf)
 val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))

 .......
 )

Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet

Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)

Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.

Thanks

edited Nov 12 '18 at 3:16

asked Nov 11 '18 at 21:23

Indira

415

val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, topicSet)
messages.mapPartitions(iter => 

 val context = TaskContext.get
 logger.info((s"Process for partition: $context.partitionId "))

 val hbaseConf = HBaseConfiguration.create()
 //hbaseConf.addResource(new File("/etc/hbase/conf/hbase-site.xml").toURI.toURL) 
 //val connection: Connection = hbaseConnection.getOrCreateConnection(hbaseConf)
 val connection = ConnectionFactory.createConnection(hbaseConf)
 val hbaseTable = connection.getTable(TableName.valueOf("prod:CustomerData"))

 .......
 )

Fetch distinct Cusotmer Id from the RDD read from Kafka before passing it to BulkGet

Also, it involves shuffling as I have to do leftOuterJoin the main RDD (from Kafka) with the BulkGet RDD (I only see the option of join as the BulkGet output is an RDD)

Can anyone please help me what is the issue with performance when I try to create HBaseConnection in mapPartitions. I have also tried setting driver-class-path.

Thanks

apache-spark hbase streaming

edited Nov 12 '18 at 3:16

asked Nov 11 '18 at 21:23

Indira

415

edited Nov 12 '18 at 3:16

asked Nov 11 '18 at 21:23

Indira

415

edited Nov 12 '18 at 3:16

asked Nov 11 '18 at 21:23

Indira

415

asked Nov 11 '18 at 21:23

Indira

415

asked Nov 11 '18 at 21:23

Indira

415

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53253365%2fissue-with-hbase-in-spark-streaming%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Dfyjkt