Author Archives: Xiaomeng (Shawn) Wan

mongodb query with cond, group, sort and limit

db.user_data.aggregate([{$match:{‘start_time’:{$gt:ISODate(“2013-07-15T00:00:00Z”),$lt:ISODate(“2013-08-15T00:00:00Z”)},’site_id’:403}},{$project:{term:’$ga.term’}},{$group:{ _id : ‘$term’ , number : { $sum : 1 }}},{$sort : { number : -1 }},{$limit : 10}]);

Posted in mongodb | Leave a comment

MongoDB commands

db.getMongo().slaveOk = true

Posted in Uncategorized | Tagged | Leave a comment

ubuntu 11.04 install python and orange

sudo apt-get update sudo apt-get upgrade sudo apt-get install gcc sudo apt-get install build-essential sudo apt-get install python-pkg-resources sudo apt-get install python-software-properties sudo add-apt-repository ppa:fkrull/deadsnakes sudo apt-get install python2.7 sudo apt-get install python-dev sudo apt-get install unzip sudo apt-get install … Continue reading

Posted in Uncategorized | Leave a comment

mr = db.runCommand({ “mapreduce” : “user_data”, “map” : function() { for (var key in this) { emit(key, null); } }, “reduce” : function(key, stuff) { return null; }, “out”: “user_data” + “_keys” }) db[mr.result].distinct(“_id”) http://stackoverflow.com/questions/2298870/mongodb-get-names-of-all-keys-in-collection  

Posted on by Xiaomeng (Shawn) Wan | Leave a comment

hadoop-lzo setup on cloudera

1. install lzo on all nodes sudo apt-get install liblzo2-dev 2. build hadoop-lzo git clone git://github.com/kevinweil/hadoop-lzo.git ant compile-native tar 3. copy jar and libraries into cluster on all nodes cp build/hadoop-lzo-*/hadoop-lzo-*.jar /usr/lib/hadoop-0.20/lib/ cp build/native/Linux-amd64-64/lib/libgplcompression.* /usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/ 4. add to core-site.xml <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value> </property> … Continue reading

Posted in Uncategorized | Leave a comment

mongoexport query timestamp

mongoexport … –query ‘{“start_time”:{“$gte”:new Date(1351641600000),”$lt”:new Date(1351728000000)}}’ –out 20121031.json *the last three digits is #operations within a given second.

Posted in mongodb | Tagged | Leave a comment

find query in mongo and pymongo

mongos> db.user_data.find({‘site_id’:37,’cart’:{$exists:true},’order’:{$exists:true},’start_time’:{$gt:ISODate(“2012-09-11T00:00:00.000Z”)}}) translated into pymongo for session in sessions.find({“site_id”:37,”cart”:{“$exists”:True},”order”:{“$exists”:True},”start_time”:{“$gt”:datetime.strptime(‘20120911′,’%Y%m%d’)}}):

Posted in mongodb | Tagged , | Leave a comment

Pig distinct on large bag

very likely cause OOM, an easy trick is to divide and conquer. Instead of group all and then distinct in group, do subgroup = group data by (SUBSTRING(field_to_be_distinct,0,n); #use n to control the number and size of subgroups subgroup_cnt = … Continue reading

Posted in pig | Tagged , | Leave a comment

sort by tab delimited column

sort -t$’\t’ -k17n,17 xxx.txt

Posted in linux | Tagged | Leave a comment

mongoDB query: is not null and compare two field values

db.user.find({campaign:{$exists:true},’order.discount’:{$ne:null},$where:function() {return this.order.discount.code != this.campaign.offer.coupon} })

Posted in mongodb | Leave a comment