Tensorflow load images for training

High level (with Estimator & input_fn) and low level (with feed_dict):

def input_fn():
    image_list = []
    label_list = []
    for f_name in glob('/Users/shawn/Documents/*.png'):
        image_list.append(f_name)
        label = int(re.match(r'.*_(\d+).png', f_name).group(1))
        label_list.append(label)
    imagest = tf.convert_to_tensor(image_list, dtype=tf.string)
    labelst = tf.convert_to_tensor(label_list, dtype=tf.int32)

    input_queue = tf.train.slice_input_producer([imagest, labelst],
                                                num_epochs=1,
                                                shuffle=True)

    filenamesq = tf.convert_to_tensor(input_queue[0], dtype=tf.string)
    file_content = tf.read_file(filenamesq)
    images = tf.image.decode_png(file_content, channels=3)
    images = tf.cast(images, tf.float32)
    images = tf.image.rgb_to_grayscale(images)
    resized_images = tf.image.resize_images(images, [80, 60])

    dataset_dict = dict(images=resized_images, labels=input_queue[1], files=imagest)
    batch_dict = tf.train.batch(dataset_dict, 100,
                                num_threads=1, capacity=100 * 2,
                                enqueue_many=False, shapes=None, dynamic_pad=False,
                                allow_smaller_final_batch=False,
                                shared_name=None, name=None)

    batch_labels = batch_dict.pop('labels')
    batch_images = batch_dict.pop('images')
    return batch_images, batch_labels

def main(unused_argv):

    classifier.fit(
      input_fn=input_fn,
      steps=100,
      monitors=[logging_hook])

image_paths = []
labels = []
for f_name in glob('/Users/shawn/Documents/*.png'):
    image_paths.append(f_name)
    label = int(re.match(r'.*_(\d+).png', f_name).group(1))
    labels.append(label)

image_paths_tf = tf.convert_to_tensor(image_paths, dtype=tf.string, name="image_paths_tf")
labels_tf = tf.convert_to_tensor(labels, dtype=tf.int32, name="labels_tf")

image_path_tf, label_tf = tf.train.slice_input_producer([image_paths_tf, labels_tf], shuffle=False)

image_buffer_tf = tf.read_file(image_path_tf, name="image_buffer")
image_tf = tf.image.decode_jpeg(image_buffer_tf, channels=3, name="image")
image_tf = preprocess_image_tensor(image_tf)  //see above processing

# creating a batch of images and labels
batch_size = 100
num_threads = 4
images_batch_tf, labels_batch_tf = tf.train.batch([image_tf, label_tf], batch_size=batch_size,
                                                  num_threads=num_threads)
# define train_step here

with tf.Session() as sess:
    sess.run(init)

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    for i in range(20):
        images, labels = sess.run([images_batch_tf, labels_batch_tf])
        _, loss_val = sess.run([train_step, loss], feed_dict={X: images, Y: labels})

    coord.request_stop() 
    coord.join(threads)
Posted in tensorflow | Tagged , , , | Leave a comment

bash: /usr/bin/[ls,find,mv]: Argument list too long

for f in $(echo folder*/file*); do mv ${f} .; done

Posted in linux | Leave a comment

Spark pipeline get best model

 val lr = new LinearRegression()

 val pipeline = new Pipeline().setStages(Array(lr))

 val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0, 0.5, 1.0)).build()

 val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)

 val cvModel = cv.fit(data)

 val model = cvModel.bestModel.asInstanceOf[PipelineModel]

 val lrModel = model.stages(0).asInstanceOf[LinearRegressionModel]

Posted in spark | Leave a comment

Spark dataframe stats mean

df.describe().rdd.map{ case r : Row => (r.getAs[String](“summary”),r) }.filter(_._1 == “mean”).map(_._2).first().toSeq.drop(1).map(x => x.toString().toDouble)

Posted in spark | Leave a comment

configure https for single instance elastic beanstalk running tomcat

  1. add configuration files to the src/main/ebextensions folder as shown in following doc: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/https-singleinstance-tomcat.html
  2. add the following plug in your pom file to ensure the extension folder end up in the root directory of the war file:
<plugin>

<artifactId>maven-war-plugin</artifactId>

<configuration>

<webResources>

<resource>

<directory>src/main/ebextensions</directory>

<targetPath>.ebextensions</targetPath>

<filtering>true</filtering>

</resource>

</webResources>

</configuration>

</plugin>

3.   add A record to route 53 to map your domain to elastic beanstalk target xxx.us-west-2.elasticbeanstalk.com

4.  (optional) ssh (eb ssh)to the ec2 instance to make sure the configuration/key/crt files are created. For some reason, the /etc/httpd/conf.d/ssl.conf isn’t created in my case, I have to add it manually, and then restart apache

Posted in aws, elastic beanstalk, ssl, tomcat | Leave a comment

s3 rename in batch

# rename.sh this example move all files in folder1 up to root directory, you can modify bucket name and regex to rename the files

for f in $(aws s3 ls –recursive s3://bucket1/folder1/ | awk -F’ ‘ ‘{print $4}’);

  do aws s3 mv s3://bucket1/$f s3://bucket1/${f/.*\//}

done

Posted in hadoop, linux | Leave a comment

pig DBStorage into mysql on EMR

sudo apt-get install libmysql-java

Pig script:

register /usr/share/java/mysql.jar

STORE results INTO ‘test’ using org.apache.pig.piggybank.storage.DBStorage(‘com.mysql.jdbc.Driver’, ‘jdbc:mysql://host_ip/database_name’, ‘username’, ‘password’, ‘INSERT INTO test (a,b,c,d) VALUES(?,?,?,?)’);

MySQL:

/etc/mysql/my.cnf (change bind-address to 0.0.0.0)

bind-address           = 0.0.0.0

sudo /etc/init.d/mysql restart

mysql -u root

INSERT INTO user (Host,User,Password) VALUES(‘%’,’username’,PASSWORD(‘password’));

GRANT ALL PRIVILEGES ON database_name.* To ‘username’@’%’ IDENTIFIED BY ‘password’;

Posted in pig | Tagged , , , | Leave a comment

python histogram with arbitrary sized bins

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv(‘data.csv’)

factors, edges = pd.qcut(data.iloc[:,3],np.arange(0,1,0.25),retbins=True)   “”” or replace np.arange(0,1,0,25) with any array of quantiles eg [0,.1,.25,.7,.99]”””

plt.hist(data.iloc[:,3],edges)

plt.show()

Posted in python | Tagged , , | Leave a comment

linux remove tab, space, return and newline

tr -d ‘\ 040\ 011\ 012\ 015’

Posted in Uncategorized | Tagged | Leave a comment

mongodb query array nth element

‘pageviews.0.page_type’:’home’ will check whether the ‘page_type’ of the first element in the pageviews array is ‘home’

Posted in Uncategorized | Tagged | Leave a comment