PIG自帶的distinct只支持整條記錄相同的過濾,並不支持對某些字段的distinct
PIG的說明如下
You cannot use DISTINCT on a subset of fields. To do this, use FOREACH…GENERATE to select the fields, and then use DISTINCT (seeExample: Nested Block).
後面例子中distinct也是先做了FILTER,然後最整個relation進行distinct
但實際應用場景由於不合理的設計和數據冗餘等問題,常常需要用到對某些字段單獨做distinct,其他字段中的數據部分有用
其實這可以結合group,foreach,和limit來實現
如數據 foo(id,field1,field2,field3)
id=1的時候field1的值有意義且一定相等
id=2的時候field1和field2的值有意義且相等
id=3的時候field1,field2,field3的值有意義且相等
(PS:這樣的數據表設計是違反數據庫設計範式的)
1,value1,other1_1,other1_2
2,value2_1,value2_2,other2_1
3,value3_1,value3_2,value3_3
1,value1,other1_3,other1_4
1,value1,other1_5,other1_6
2,value2_1,value2_2,other2_2
4,value4_1,value4_2,
只對id做distinct的PIG代碼:
foo = LOAD 'foo' USING PigStorage(',') AS (id:int, field1:chararray, field2:chararray, field3:chararray);
foo_group = GROUP foo BY id;
result = FOREACH foo_group{
foo_one = LIMIT foo 1;
GENERATE FLATTEN(foo_one);
}
dump result;
結果:
(1,value1,other1_1,other1_2)
(2,value2_1,value2_2,other2_1)
(3,value3_1,value3_2,value3_3)
(4,value4_1,value4_2,)
以前上代碼在PIG0.9.2運行通過