mongodb query with cond, group, sort and limit

db.user_data.aggregate([{$match:{‘start_time’:{$gt:ISODate(“2013-07-15T00:00:00Z”),$lt:ISODate(“2013-08-15T00:00:00Z”)},’site_id’:403}},{$project:{term:’$ga.term’}},{$group:{ _id : ‘$term’ , number : { $sum : 1 }}},{$sort : { number : -1 }},{$limit : 10}]);

Posted in mongodb | Leave a comment

MongoDB commands

db.getMongo().slaveOk = true

Posted in Uncategorized | Tagged | Leave a comment

ubuntu 11.04 install python and orange

sudo apt-get update

sudo apt-get upgrade

sudo apt-get install gcc

sudo apt-get install build-essential

sudo apt-get install python-pkg-resources

sudo apt-get install python-software-properties

sudo add-apt-repository ppa:fkrull/deadsnakes

sudo apt-get install python2.7

sudo apt-get install python-dev

sudo apt-get install unzip

sudo apt-get install python-numpy

sudo apt-get install python-scipy

wget http://orange.biolab.si/download/orange-source-snapshot-hg-2013-02-20.zip

unzip orange-source-snapshot-hg-2013-02-20.zip

cd Orange-2.6.1/

python setup.py build

sudo python setup.py install

Posted in Uncategorized | Leave a comment
mr = db.runCommand({
  "mapreduce" : "user_data",
  "map" : function() {
    for (var key in this) { emit(key, null); }
  },
  "reduce" : function(key, stuff) { return null; }, 
  "out": "user_data" + "_keys"
})
db[mr.result].distinct("_id")
http://stackoverflow.com/questions/2298870/mongodb-get-names-of-all-keys-in-collection
 
Posted on by Xiaomeng (Shawn) Wan | Leave a comment

hadoop-lzo setup on cloudera

1. install lzo on all nodes

sudo apt-get install liblzo2-dev

2. build hadoop-lzo

git clone git://github.com/kevinweil/hadoop-lzo.git
ant compile-native tar

3. copy jar and libraries into cluster on all nodes

cp build/hadoop-lzo-*/hadoop-lzo-*.jar /usr/lib/hadoop-0.20/lib/
cp build/native/Linux-amd64-64/lib/libgplcompression.* /usr/lib/hadoop-0.20/lib/native/Linux-amd64-64/

4. add to core-site.xml

<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

add to mapred-site.xml

<property>
<name>mapred.child.env</name>
<value>JAVA_LIBRARY_PATH=/usr/lib/hadoop-0.20/lib/native</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

5. stop and restart tasktrackers and jobtracker

6. test script

register elephant-bird/core/target/elephant-bird-core-3.0.5-SNAPSHOT.jar
register elephant-bird/pig/target/elephant-bird-pig-3.0.5-SNAPSHOT.jar
register google-collections-1.0-rc1.jar
register json-simple-1.1.1.jar
raw_data = LOAD 'test.json.lzo' USING com.twitter.elephantbird.pig.load.LzoJsonLoader() as (json: map[]);
a = limit raw_data 10;
store a into 'lzo' USING com.twitter.elephantbird.pig.store.LzoJsonStorage();
Posted in Uncategorized | Leave a comment

mongoexport query timestamp

mongoexport ... --query '{"start_time":{"$gte":new Date(1351641600000),"$lt":new Date(1351728000000)}}' --out 20121031.json

*the last three digits is #operations within a given second.

Posted in mongodb | Tagged | Leave a comment

find query in mongo and pymongo

mongos> db.user_data.find({'site_id':37,'cart':{$exists:true},'order':{$exists:true},'start_time':{$gt:ISODate("2012-09-11T00:00:00.000Z")}})

translated into pymongo

for session in sessions.find({"site_id":37,"cart":{"$exists":True},"order":{"$exists":True},"start_time":{"$gt":datetime.strptime('20120911','%Y%m%d')}}):
Posted in mongodb | Tagged , | Leave a comment

Pig distinct on large bag

very likely cause OOM, an easy trick is to divide and conquer. Instead of group all and then distinct in group, do

subgroup = group data by (SUBSTRING(field_to_be_distinct,0,n); #use n to control the number and size of subgroups
subgroup_cnt = foreach subgroup { x = distinct field_to_be_distinct; generate COUNT(x) as cnt; }
alltogether = group subgroup_cnt all;
total = foreach alltogether generate SUM(subgroup_cnt.cnt);
Posted in pig | Tagged , | Leave a comment

sort by tab delimited column

sort -t$’\t’ -k17n,17 xxx.txt

Posted in linux | Tagged | Leave a comment

mongoDB query: is not null and compare two field values

db.user.find({campaign:{$exists:true},’order.discount’:{$ne:null},$where:function() {return this.order.discount.code != this.campaign.offer.coupon} })

Posted in mongodb | Leave a comment