pig中distinct用法

Distinct

只能处理关系中的整个记录，不能是表达式，或者部分域。

--distinct.pig

daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray);

uniq = distinct daily;

uniq是一个关系relation(类似表，是流对象)。不是表达式

“distinct forces a reduce phase. It does make use of the combiner toremove，any duplicate records it can delete in the map phase.”其会强制执行一个reduce阶段（很多语句只用map就能完成，不需要reduce），此外还有order，join，group，limit，cogroup，cross等等。

而之所以distinct很快，是因为其在map阶段执行了combiner，提高效率。

其parallel只能控制reduce端，所以在设计程序时指定parallel的个数，实际上是指定了reduce的个数，而在reduce过程中，是自动hashing到对应的reduce中。

--distinct_symbols.pig

daily = load 'NYSE_daily' as (exchange, symbol); -- not interested in otherfields

grpd = group daily by exchange;

uniqcnt = foreach grpd {

sym = daily.symbol;

uniq_sym = distinct sym;

generate group, COUNT(uniq_sym);

};

另外，distinct只能处理relation，不能处理expression。

这里，daily.symbol是expression，不能被distinct。必须提取到sym后成为关系，才能被distinct。