特定區域人口變化模型(Scala+Hive)
今天來寫一個新的模型,遇到一個問題。比如我想統計該數據源下一天內按時間分段求和,怎麼用SQL來搞定呢。之前也找過scala和java下的時間處理工具類,還是不如SQL來的方便。今天特此記錄一下。
需求:按地域,把一天內按一小時分段。分別統計一小時內該地域人口總量。
話不多說,直接上代碼。
Oracle實現版本
val oracleDriverUrl = "jdbc:oracle:thin:@192.168.37.120:1521/test";
val jdbcMap1 = Map(
"url" -> oracleDriverUrl,
"user" -> "n620",
"password" -> "ndsc",
"driver" -> "oracle.jdbc.driver.OracleDriver",
"dbtable" -> "(select to_date('20660114','YYYYMMDD') + 1 - (level - 1) / 48 timepoint from dual connect by level <= 48) datetime"
)
spark.read.options(jdbcMap1).format("jdbc").load().createOrReplaceTempView("datetime")
val date = spark.sql("select * from datetime")
spark.sql("use n620")
val xdrhive: DataFrame = spark.sql("select * from xdr20660114")
xdrhive.createOrReplaceTempView("xdr20660114")
val elasticsearchDF = spark.sqlContext.read.format("org.elasticsearch.spark.sql").option("inferSchema","true").load("station_locationdis/g4bs-location")
elasticsearchDF.createOrReplaceTempView("STATIONLOCATIONDIS")
val listdata = spark.sql("SELECT X.GUTI,X.ACCOUNT,X.IMSI,X.TIMESTAMP,T.* FROM STATIONLOCATIONDIS T INNER JOIN XDR20660114 X ON T.g4bs_id = X.G4BSID where X.GUTI " +
"IS NOT NULL AND X.ACCOUNT IS NOT NULL AND X.ACCOUNT != '0' and order by x.timestamp")
listdata.createOrReplaceTempView("STATIONLOCATIONRESULTDIS")
listdata.show()
val datatime = date.collect()
val rows: Array[Row] = date.collect()
val starttime = datatime(21).toString()
val endtime = datatime(20).toString
val start = starttime.substring(1, starttime.length - 1)
val end = endtime.substring(1, endtime.length - 1)
val certaindata = spark.sql("select t.province_region,t.city_region,'"+end+"' as data_time,count(*) from STATIONLOCATIONRESULTDIS t where datediff(from_unixtime(unix_timestamp(t.timestamp,'yyyy-MM-dd HH:mm:ss'),'yyyy-MM-dd HH:mm:ss') ,'" +
start + "')>=0 and datediff(from_unixtime(unix_timestamp(t.timestamp,'yyyy-MM-dd HH:mm:ss'),'yyyy-MM-dd HH:mm:ss') ,'" + end +
"')<=0 and t.city_region is not null and t.province_region is not null group by t.province_region,t.city_region")
Hive實現版本
val xdrhive: DataFrame = spark.sql("select * from xdr")
xdrhive.createOrReplaceTempView("xdr20660114")
val elasticsearchDF = spark.sqlContext.read.format("org.elasticsearch.spark.sql").option("inferSchema","true").load("lac/lac")
elasticsearchDF.createOrReplaceTempView("STATIONLOCATIONDIS")
val listdata = spark.sql("SELECT X.GUTI,X.ACCOUNT,X.IMSI,X.datetime,T.* FROM STATIONLOCATIONDIS T INNER JOIN XDR20660114 X ON T.4gbasenum = X.G4BSID where X.GUTI IS NOT NULL AND X.ACCOUNT IS NOT NULL AND X.ACCOUNT != '0' order by x.datetime")
listdata.createOrReplaceTempView("STATIONLOCATIONRESULTDIS")
spark.sql("SELECT FROM_UNIXTIME(60*60*CAST(UNIX_TIMESTAMP(datetime)/(60*60) AS BIGINT), 'yyyy-MM-dd HH:mm:ss') as data_time,imsi FROM STATIONLOCATIONRESULTDIS ").createOrReplaceTempView("tempdate")
val certaindata = spark.sql("select f.data_time,f.pro_name,f.city_name,count(*) from " +
"(select n.pro_name,n.city_name,s.data_time from STATIONLOCATIONRESULTDIS n join tempdate s on n.imsi = s.imsi) f where f.data_time is not null " +
"group by f.data_time,f.pro_name,f.city_name").show()
啊~折騰一下午,最後數據出來結果貼一下吧!!!
±------------------±-------±--------±-------+
| data_time|pro_name|city_name|count(1)|
±------------------±-------±--------±-------+
|2020-02-19 01:00:00| 江蘇省| 南通市| 1|
|2020-02-19 11:00:00| 湖北省| 黃岡市| 1|
|2020-02-19 13:00:00| 遼寧省| 錦州市| 1|
|2020-02-19 15:00:00| 上海市| 黃浦區| 1|
|2020-02-19 05:00:00| 上海市| 浦東新區| 3|
|2020-02-19 13:00:00| 上海市| 寶山區| 1|
|2020-02-19 19:00:00| 河南省| 濮陽市| 1|
|2020-02-19 08:00:00| 安徽省| 馬鞍山市| 1|
|2020-02-19 21:00:00| 上海市| 青浦區| 2|
|2020-02-19 10:00:00| 江西省| 撫州市| 1|
|2020-02-19 20:00:00| 上海市| 閘北區| 2|
|2020-02-19 15:00:00| 上海市| 寶山區| 2|
|2020-02-19 00:00:00| 遼寧省| 大連市| 1|
|2020-02-19 05:00:00| 上海市| 閘北區| 1|
|2020-02-19 02:00:00| 上海市| 松江區| 1|
|2020-02-19 23:00:00| 山東省| 青島市| 1|
|2020-02-19 17:00:00| 浙江省| 台州市| 1|
|2020-02-19 14:00:00| 山西省| 運城市| 1|
|2020-02-19 23:00:00| 浙江省| 杭州市| 1|
|2020-02-19 21:00:00| 甘肅省| 天水市| 2|
±------------------±-------±--------±-------+
歡迎關注哦~~~~~