Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java
Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java
I've googled it all day long and couldn't find straight answer, so ended up posting a question here.
I have a file containing line-delimited json objects:
"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]
"device_id": "103b", "timestamp": 1435677490, "rooms": ["Office", "Lab"]
"device_id": "103b", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]
My goal is to parse this file with Apache Spark in Java. I referenced How to Parsing CSV or JSON File with Apache Spark and so far I could successfully parse each line of json to JavaRDD using Gson.
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data = sc.textFile("fileName");
JavaRDD<JsonObject> records = data.map(new Function<String, JsonObject>()
public JsonObject call(String line) throws Exception
Gson gson = new Gson();
JsonObject json = gson.fromJson(line, JsonObject.class);
return json;
);
Where I'm really stuck is I want to deserialize the "rooms" array so that it can fit to my class Event.
public class Event implements Serializable
public static final long serialVersionUID = 42L;
private String deviceId;
private int timestamp;
private String room;
// constructor , getters and setters
In other words, from this line:
"device_id": "103b", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]
I want to create two Event objects in Spark:
obj1: deviceId = "103b", timestamp = 1436941050, room = "Office"
obj2: deviceId = "103b", timestamp = 1436941050, room = "Foyer"
I did my little search and tried flatMapVlue, but no luck... It threw me an error...
JavaRDD<Event> events = records.flatMapValue(new Function<JsonObject, Iterable<Event>>()
public Iterable<Event> call(JsonObject json) throws Exception
JsonArray rooms = json.get("rooms").getAsJsonArray();
List<Event> data = new LinkedList<Event>();
for (JsonElement room : rooms)
data.add(new Event(json.get("device_id").getAsString(), json.get("timestamp").getAsInt(), room.toString()));
return data;
);
I'm very new to Spark and Map/Reduce. I would be grateful if you can help me out. Thanks in advance!
2 Answers
2
If you load json data into a DataFrame
:
DataFrame
DataFrame df = sqlContext.read().json("/path/to/json");
You could easily do this by explode
.
explode
df.select(
df.col("device_id"),
df.col("timestamp"),
org.apache.spark.sql.functions.explode(df.col("rooms")).as("room")
);
For input:
"device_id": "1", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]
"device_id": "2", "timestamp": 1435677490, "rooms": ["Office", "Lab"]
"device_id": "3", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]
You will get:
+---------+------+----------+
|device_id| room| timestamp|
+---------+------+----------+
| 1|Office|1436941050|
| 1| Foyer|1436941050|
| 2|Office|1435677490|
| 2| Lab|1435677490|
| 3|Office|1436673850|
| 3| Foyer|1436673850|
+---------+------+----------+
Thanks for letting me know this useful feature. I didn't know Spark supports Hive like UDF. It's very helpful!
– gyoho
Jul 13 '16 at 16:57
spark is full compatible with hive (*≧▽≦)
– Yuan JI
Jul 13 '16 at 17:00
val formatrecord = records.map(fromJson[mapClass](_))
mapClass should be a case class for mapping the object inside the records json.
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Please, post your error. Edit your post and add the stacktrace
– Vladimir Vagaytsev
Jul 13 '16 at 7:53