Author Archives: Xiaomeng (Shawn) Wan

Tensorflow load images for training

High level (with Estimator & input_fn) and low level (with feed_dict): def input_fn(): image_list = [] label_list = [] for f_name in glob(‘/Users/shawn/Documents/*.png’): image_list.append(f_name) label = int(re.match(r’.*_(\d+).png’, f_name).group(1)) label_list.append(label) imagest = tf.convert_to_tensor(image_list, dtype=tf.string) labelst = tf.convert_to_tensor(label_list, dtype=tf.int32) input_queue = tf.train.slice_input_producer([imagest, … Continue reading

Posted in tensorflow | Tagged , , , | Leave a comment

bash: /usr/bin/[ls,find,mv]: Argument list too long

for f in $(echo folder*/file*); do mv ${f} .; done

Posted in linux | Leave a comment

Spark pipeline get best model

 val lr = new LinearRegression()  val pipeline = new Pipeline().setStages(Array(lr))  val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0, 0.5, 1.0)).build()  val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)  val cvModel = cv.fit(data)  val model = cvModel.bestModel.asInstanceOf[PipelineModel]  val lrModel = model.stages(0).asInstanceOf[LinearRegressionModel]

Posted in spark | Leave a comment

Spark dataframe stats mean

df.describe().rdd.map{ case r : Row => (r.getAs[String](“summary”),r) }.filter(_._1 == “mean”).map(_._2).first().toSeq.drop(1).map(x => x.toString().toDouble)

Posted in spark | Leave a comment

configure https for single instance elastic beanstalk running tomcat

add configuration files to the src/main/ebextensions folder as shown in following doc: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/https-singleinstance-tomcat.html add the following plug in your pom file to ensure the extension folder end up in the root directory of the war file: <plugin> <artifactId>maven-war-plugin</artifactId> <configuration> <webResources> <resource> … Continue reading

Posted in aws, elastic beanstalk, ssl, tomcat | Leave a comment

s3 rename in batch

# rename.sh this example move all files in folder1 up to root directory, you can modify bucket name and regex to rename the files for f in $(aws s3 ls –recursive s3://bucket1/folder1/ | awk -F’ ‘ ‘{print $4}’);   do … Continue reading

Posted in hadoop, linux | Leave a comment

pig DBStorage into mysql on EMR

sudo apt-get install libmysql-java Pig script: register /usr/share/java/mysql.jar STORE results INTO ‘test’ using org.apache.pig.piggybank.storage.DBStorage(‘com.mysql.jdbc.Driver’, ‘jdbc:mysql://host_ip/database_name’, ‘username’, ‘password’, ‘INSERT INTO test (a,b,c,d) VALUES(?,?,?,?)’); MySQL: /etc/mysql/my.cnf (change bind-address to 0.0.0.0) bind-address           = 0.0.0.0 sudo /etc/init.d/mysql restart mysql -u root INSERT INTO user … Continue reading

Posted in pig | Tagged , , , | Leave a comment

python histogram with arbitrary sized bins

import pandas as pd import numpy as np import matplotlib.pyplot as plt data = pd.read_csv(‘data.csv’) factors, edges = pd.qcut(data.iloc[:,3],np.arange(0,1,0.25),retbins=True)   “”” or replace np.arange(0,1,0,25) with any array of quantiles eg [0,.1,.25,.7,.99]””” plt.hist(data.iloc[:,3],edges) plt.show()

Posted in python | Tagged , , | Leave a comment

linux remove tab, space, return and newline

tr -d ‘\ 040\ 011\ 012\ 015’

Posted in Uncategorized | Tagged | Leave a comment

mongodb query array nth element

‘pageviews.0.page_type’:’home’ will check whether the ‘page_type’ of the first element in the pageviews array is ‘home’

Posted in Uncategorized | Tagged | Leave a comment