一,分配更多的資源
bin/spark-submit \
--class cn.spark.sparktest.core.WordCountCluster \
--driver-memory 100m \配置driver的內存(影響不大)
--num-executors 3 \ 配置executor的數量
--executor-memory 100m \ 配置每個executor的內存大小
--executor-cores 3 \ 配置每個executor的cpu core數量
/usr/local/SparkTest-0.0.1-SNAPSHOT-jar-with-dependencies.jar
二,設置spark application的並行度
SparkConf conf=new SparkConf().set("spark.default.paralelism","500")
三,RDD架構重構和優化
四,廣播大變量
final Broadcast<Map<String,Map<String,List<Integer>>>> dateHourExtractMapBroadcast=sc.broadcast(dateHourExtractMap);
Map<String, Map<String, List<Integer>>> dateHourExtractMap =dateHourExtractMapBroadcast.value();
五,在項目中使用Kryo序列化
set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
六,在項目中使用fastutil框架
import it.unimi.dsi.fastutil.ints.IntArrayList;
import it.unimi.dsi.fastutil.ints.IntList;
Map<String,Map<String,IntList>> fastutilDateHourExtractMap=new HashMap<String, Map<String, IntList>>();
for(Map.Entry<String, Map<String,List<Integer>>> dateHourExtractEntry:dateHourExtractMap.entrySet()){
String date=dateHourExtractEntry.getKey();
Map<String,List<Integer>> hourExtractMap=dateHourExtractEntry.getValue();
Map<String, IntList> fastutilHourExtractMap = new HashMap<String, IntList>();
for(Map.Entry<String, List<Integer>> hourExtractEntry : hourExtractMap.entrySet()){
String hour = hourExtractEntry.getKey();
List<Integer> extractList = hourExtractEntry.getValue();
IntList fastutilExtractList = new IntArrayList();
for(int i = 0; i < extractList.size(); i++) {
fastutilExtractList.add(extractList.get(i));
}
fastutilHourExtractMap.put(hour, fastutilExtractList);
}
fastutilDateHourExtractMap.put(date, fastutilHourExtractMap);
}
七,調節本地化等待時長
SparkConf conf = new SparkConf()
.setAppName(Constants.SPARK_APP_NAME_SESSION)
.setMaster("local")
.set("spark.default.paralelism", "500")
.set("spark.locality.wait","10")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")