Loading XML file to Hive tables
up vote
0
down vote
favorite
We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.
I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.
The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.
While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.
I understood that the table which I am using to read the data is NOT 100% correct.
Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions
I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.
Sample data:
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
Hive table:
CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
Hive query output:
select * from BASKET_TENDER
xml parsing hadoop xpath hive
add a comment |
up vote
0
down vote
favorite
We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.
I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.
The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.
While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.
I understood that the table which I am using to read the data is NOT 100% correct.
Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions
I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.
Sample data:
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
Hive table:
CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
Hive query output:
select * from BASKET_TENDER
xml parsing hadoop xpath hive
Typo on/Paytype/text()
...
– cricket_007
Nov 9 at 1:00
So, you have two possible/Bskt/Tender/
paths for every<Bskt>
. How are you letting Hive know which one it needs to parse?
– cricket_007
Nov 9 at 1:03
Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09
I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.
I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.
The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.
While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.
I understood that the table which I am using to read the data is NOT 100% correct.
Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions
I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.
Sample data:
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
Hive table:
CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
Hive query output:
select * from BASKET_TENDER
xml parsing hadoop xpath hive
We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.
I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.
The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.
While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.
I understood that the table which I am using to read the data is NOT 100% correct.
Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions
I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.
Sample data:
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1747</BsktNo>
<DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2679911927</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1748</BsktNo>
<DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
<OpID>10115</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>46.75</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>2619183833</CardNo>
<Program>SmartRewards</Program>
<Earn>46.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Tender>
<PayType>CSH</PayType>
<Amt>28.10</Amt>
</Tender>
<Tender>
<PayType>ITMLOY</PayType>
<Amt>0</Amt>
<CardNo>8263734549</CardNo>
<Program>SmartRewards</Program>
<Earn>28.00</Earn>
<Burn>0.00</Burn>
</Tender>
</Bskt>
Hive table:
CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '<hdfs file location>'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
Hive query output:
select * from BASKET_TENDER
xml parsing hadoop xpath hive
xml parsing hadoop xpath hive
edited Nov 9 at 0:59
cricket_007
77.2k1142106
77.2k1142106
asked Nov 8 at 23:31
Nandakumar_bigdata
53
53
Typo on/Paytype/text()
...
– cricket_007
Nov 9 at 1:00
So, you have two possible/Bskt/Tender/
paths for every<Bskt>
. How are you letting Hive know which one it needs to parse?
– cricket_007
Nov 9 at 1:03
Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09
I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13
add a comment |
Typo on/Paytype/text()
...
– cricket_007
Nov 9 at 1:00
So, you have two possible/Bskt/Tender/
paths for every<Bskt>
. How are you letting Hive know which one it needs to parse?
– cricket_007
Nov 9 at 1:03
Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09
I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13
Typo on
/Paytype/text()
...– cricket_007
Nov 9 at 1:00
Typo on
/Paytype/text()
...– cricket_007
Nov 9 at 1:00
So, you have two possible
/Bskt/Tender/
paths for every <Bskt>
. How are you letting Hive know which one it needs to parse?– cricket_007
Nov 9 at 1:03
So, you have two possible
/Bskt/Tender/
paths for every <Bskt>
. How are you letting Hive know which one it needs to parse?– cricket_007
Nov 9 at 1:03
Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09
Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09
I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13
I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
GET only one Field
You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.
In your case, I assume you want second instance of tender. Then just use following xml path.
"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",
This will give you following values.
Get Both the fields
if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below
drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
select * from temp.BASKET_TENDER;
The output would be as mentioned blow
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
GET only one Field
You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.
In your case, I assume you want second instance of tender. Then just use following xml path.
"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",
This will give you following values.
Get Both the fields
if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below
drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
select * from temp.BASKET_TENDER;
The output would be as mentioned blow
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
add a comment |
up vote
1
down vote
accepted
GET only one Field
You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.
In your case, I assume you want second instance of tender. Then just use following xml path.
"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",
This will give you following values.
Get Both the fields
if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below
drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
select * from temp.BASKET_TENDER;
The output would be as mentioned blow
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
GET only one Field
You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.
In your case, I assume you want second instance of tender. Then just use following xml path.
"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",
This will give you following values.
Get Both the fields
if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below
drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
select * from temp.BASKET_TENDER;
The output would be as mentioned blow
GET only one Field
You can use subscript operator () to parse xml. However, please make sure index starts with 1 and not 0.
In your case, I assume you want second instance of tender. Then just use following xml path.
"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",
This will give you following values.
Get Both the fields
if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below
drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION '/tmp/BASKET_TENDER'
TBLPROPERTIES (
"xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);
select * from temp.BASKET_TENDER;
The output would be as mentioned blow
edited Nov 9 at 17:46
answered Nov 9 at 14:30
Gaurang Shah
2,67311030
2,67311030
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
add a comment |
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
Would it be possible to get XPATH to fill a Hive array field?
– cricket_007
Nov 9 at 16:09
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
yes, it's possible in that case you need to have fields as array
– Gaurang Shah
Nov 9 at 17:42
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Thank you so much Gaurang.. It solved my problem. But the options provided above works fine :)
– Nandakumar_bigdata
Nov 11 at 22:18
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
Typo.. Both the options provided works fine :)
– Nandakumar_bigdata
Nov 11 at 23:19
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
HI Gaurang.. Could you please help me with a solution for my problem here:stackoverflow.com/questions/53384728/…
– Nandakumar_bigdata
Nov 20 at 3:05
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53217712%2floading-xml-file-to-hive-tables%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Typo on
/Paytype/text()
...– cricket_007
Nov 9 at 1:00
So, you have two possible
/Bskt/Tender/
paths for every<Bskt>
. How are you letting Hive know which one it needs to parse?– cricket_007
Nov 9 at 1:03
Typo Corrected. Even my confusion is same.How can i let Hive know which one it needs to parse in my case where each 'Bskt' has two 'Tenders' with same names for elements "Payment" and "Amt".
– Nandakumar_bigdata
Nov 9 at 4:09
I need to achieve either of these: 1) Create another row for 2nd tender with same Basket details. OR 2) Create another column called "Payment_Loyalty" and "Amt_Loyalty" and load the second tender details there. I'm wondering how i can achieve either of these solutions.
– Nandakumar_bigdata
Nov 9 at 4:13