Hive--筆試題05_1--求TopN

現在有一個面試題

 

場景舉例

北京市學生成績分析

 

成績的數據格式

exercise5_1.txt 文件中的每一行就是一個學生的成績信息。字段之間的分隔符是","

時間,學校,年紀,姓名,科目,成績


樣例數據

2013,北大,1,黃渤,語文,97
2013,北大,1,徐崢,語文,52
2013,北大,1,劉德華,語文,85
2012,清華,0,馬雲,英語,61
2015,北理工,3,李彥宏,物理,81
2016,北科,4,馬化騰,化學,92
2014,北航,2,劉強東,數學,70
2012,清華,0,劉詩詩,英語,59
2014,北航,2,劉亦菲,數學,49
2014,北航,2,劉嘉玲,數學,77

 

建表導數據

create database if not exists exercise;
use exercise;
drop table if exists exercise5_1;
create table exercise5_1(year int, school string, grade int, name string, course string, score int) row format delimited fields terminated by ',';
load data local inpath "/home/hadoop/exercise5_1.txt" into table exercise5_1;
select * from exercise5_1;
desc exercise5_1;

 

 

需求

1、分組TopN,選出今年每個學校、每個年級、分數前三的科目

select t.* 
from 
(select school, grade, course, score,
row_number() over (partition by school, grade, course order by score desc) rank_code 
from exercise5_1 
where year = "2017"
) t
where t.rank_code <= 3;

詳解如下:
  row_number函數:row_number() 按指定的列進行分組生成行序列,從 1 開始,如果兩行記錄的分組列相同,則行序列 +1。
  over 函數:是一個窗口函數。
  over (order by score) 按照 score 排序進行累計,order by 是個默認的開窗函數。
  over (partition by grade) 按照班級分區。
  over (partition by grade order by score) 按照班級分區,並按着分數排序。
  over (order by score range between 2 preceding and 2 following) 窗口範圍爲當前行的數據幅度減2加2後的範圍內的數據求和。

 

2、今年,北航,每個班級,每科的分數,及分數上下浮動2分的總和

select school, grade, course, score,
sum(score) over (order by score range between 2 preceding and 2 following) sscore
from exercise5_1 
where year = "2017" and school="北航";

 

3、where與having:今年,清華1年級,總成績大於200分的學生以及學生數

select school, grade, name, sum(score) as total_score,
count(1) over (partition by school, grade) nct
from exercise5_1
where year = "2017" and school="清華" and grade = 1
group by school, grade, name
having total_score > 200;

having 是分組(group by)後的篩選條件,分組後的數據組內再篩選,也就是說 HAVING 子句可以讓我們篩選成組後的各組數據。
where 則是在分組,聚合前先篩選記錄。也就是說作用在 GROUP BY 子句和 HAVING 子句前。

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章