spark-broadcast

spark-broadcast

@(spark)[broadcast]
Spark’s broadcast variables, used to broadcast immutable datasets to all node

Broadcast

/**                                                                                                                                                                     
 * A broadcast variable. Broadcast variables allow the programmer to keep a read-only variable                                                                          
 * cached on each machine rather than shipping a copy of it with tasks. They can be used, for                                                                           
 * example, to give every node a copy of a large input dataset in an efficient manner. Spark also                                                                       
 * attempts to distribute broadcast variables using efficient broadcast algorithms to reduce                                                                            
 * communication cost.                                                                                                                                                  
 *                                                                                                                                                                      
 * Broadcast variables are created from a variable `v` by calling                                                                                                       
 * [[org.apache.spark.SparkContext#broadcast]].                                                                                                                         
 * The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the                                                                         
 * `value` method. The interpreter session below shows this:                                                                                                            
 *                                                                                                                                                                      
 * {{{                                                                                                                                                                  
 * scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))                                                                                                               
 * broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)                                                                                        
 *                                                                                                                                                                      
 * scala> broadcastVar.value                                                                                                                                            
 * res0: Array[Int] = Array(1, 2, 3)                                                                                                                                    
 * }}}                                                                                                                                                                  
 *                                                                                                                                                                      
 * After the broadcast variable is created, it should be used instead of the value `v` in any                                                                           
 * functions run on the cluster so that `v` is not shipped to the nodes more than once.                                                                                 
 * In addition, the object `v` should not be modified after it is broadcast in order to ensure                                                                          
 * that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped                                                                         
 * to a new node later).                                                                                                                                                
 *                                                                                                                                                                      
 * @param id A unique identifier for the broadcast variable.                                                                                                            
 * @tparam T Type of the data contained in the broadcast variable.                                                                                                      
 */                                                                                                                                                                     
abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging {    



/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * An interface for all the broadcast implementations in Spark (to allow                                                                                                
 * multiple broadcast implementations). SparkContext uses a user-specified                                                                                              
 * BroadcastFactory implementation to instantiate a particular broadcast for the                                                                                        
 * entire Spark job.                                                                                                                                                    
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
trait BroadcastFactory {   

目前有兩組實現,默認的是後者

HttpBroadcast

/**                                                                                                                                                                     
 * A [[org.apache.spark.broadcast.Broadcast]] implementation that uses HTTP server                                                                                      
 * as a broadcast mechanism. The first time a HTTP broadcast variable (sent as part of a                                                                                
 * task) is deserialized in the executor, the broadcasted data is fetched from the driver                                                                               
 * (through a HTTP server running at the driver) and stored in the BlockManager of the                                                                                  
 * executor to speed up future accesses.                                                                                                                                
 */                                                                                                                                                                     
private[spark] class HttpBroadcast[T: ClassTag](     

TorrentBroadcast

/**                                                                                                                                                                     
 * A BitTorrent-like implementation of [[org.apache.spark.broadcast.Broadcast]].                                                                                        
 *                                                                                                                                                                      
 * The mechanism is as follows:                                                                                                                                         
 *                                                                                                                                                                      
 * The driver divides the serialized object into small chunks and                                                                                                       
 * stores those chunks in the BlockManager of the driver.                                                                                                               
 *                                                                                                                                                                      
 * On each executor, the executor first attempts to fetch the object from its BlockManager. If                                                                          
 * it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or                                                                      
 * other executors if available. Once it gets the chunks, it puts the chunks in its own                                                                                 
 * BlockManager, ready for other executors to fetch from.                                                                                                               
 *                                                                                                                                                                      
 * This prevents the driver from being the bottleneck in sending out multiple copies of the                                                                             
 * broadcast data (one per executor) as done by the [[org.apache.spark.broadcast.HttpBroadcast]].                                                                       
 *                                                                                                                                                                      
 * When initialized, TorrentBroadcast objects read SparkEnv.get.conf.                                                                                                   
 *                                                                                                                                                                      
 * @param obj object to broadcast                                                                                                                                       
 * @param id A unique identifier for the broadcast variable.                                                                                                            
 */                                                                                                                                                                     
private[spark] class TorrentBroadcast[T: ClassTag](obj: T, id: Long)                                                                                                    
  extends Broadcast[T](id) with Logging with Serializable {

隨機選遠程節點這個事情,是由blockManger完成的

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章