Monthly Archives: August 2012

Pig distinct on large bag

very likely cause OOM, an easy trick is to divide and conquer. Instead of group all and then distinct in group, do subgroup = group data by (SUBSTRING(field_to_be_distinct,0,n); #use n to control the number and size of subgroups subgroup_cnt = … Continue reading

Posted in pig | Tagged , | Leave a comment

sort by tab delimited column

sort -t$’\t’ -k17n,17 xxx.txt

Posted in linux | Tagged | Leave a comment