Tuesday, November 1, 2016

Spark - Broadcast Joins

In continuation to the previous post, using the same example of stations and trips,

scala> val bcStations = sc.broadcast(station.keyBy(_.id).collectAsMap)
bcStations: org.apache.spark.broadcast.Broadcast[scala.collection.Map[Int,Station]] = Broadcast(19)

scala> val bcStations = sc.broadcast(station.keyBy(_.id))
 WARN MemoryStore: Not enough space to store block broadcast_20 in memory! Free memory is 252747804 bytes.
 WARN MemoryStore: Persisting block broadcast_20 to disk instead.
bcStations: org.apache.spark.broadcast.Broadcast[org.apache.spark.rdd.RDD[(Int, Station)]] = Broadcast(20)

scala> bcStations.value.take(3).foreach(println)
(83,Station(83,Mezes Park,37.491269,-122.236234,15,Redwood City,2/20/2014))
(77,Station(77,Market at Sansome,37.789625,-122.400811,27,San Francisco,8/25/2013))
(23,Station(23,San Mateo County Center,37.488501,-122.231061,15,Redwood City,8/15/2013))

No comments:

Post a Comment