Downloading all files from a Amazon S3 bucket

I was trying to download all files from an Amazon S3 bucket, and did not feel like clicking through all of the files. Here is the little Python(2) script I came up with:

import urllib2
 
print("Retrieving file list ...")
url = urllib2.urlopen('https://s3.amazonaws.com/tripdata?max-keys=9999999')
data = url.read()
url.close()
 
print("Parsing file list ...")
import xml.etree.ElementTree
e = xml.etree.ElementTree.fromstring(data)
 
keys = e.findall('{http://s3.amazonaws.com/doc/2006-03-01/}Contents/{http://s3.amazonaws.com/doc/2006-03-01/}Key')
keys = filter(lambda k: ".zip" in k, [ k.text for k in  keys ])
 
print("Found files: " + str(len(keys)))
 
print("Start downloading ...")
import os
os.mkdir("data")
for k in keys:
        print("Downloading: " + k)
        with open("data/" + k, "wb") as f:
                f.write(urllib2.urlopen("https://s3.amazonaws.com/tripdata/" + k).read())
                f.close()
 
print("Done downloading.")

If anybody knows an easier way, please let me know.

Sources:

  • http://dabase.com/e/14003/
  • http://stackoverflow.com/questions/4028697/how-do-i-download-a-zip-file-in-python-using-urllib2
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

NFS Automount / Autofs Timeout

Recently, we had an issue where on one machine, a single NFS mountpoint was not mountable (in our case via automount/autofs). Other mountpoints from the same server worked and the same mountpoint worked on other servers. So a very mysterious issue indeed. It turned out that this was caused by a server problem, i.e., the one providing the NFS mountpoint, related to portmap. We found a solution here: http://serverfault.com/questions/482479/cant-mount-nfs-volume-time-out:

/etc/init.d/nfsserver stop # might also be called nfs-kernel-server
/etc/init.d/portmap stop
 
/etc/init.d/nfsserver start
/etc/init.d/portmap start
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Backing up (MySQL Database) via LVM Snapshot

I have recently come across a nice solution for backing up ("large" MySQL databases without having to handle a slave explicitly for dumps). That is, if a short outage is fine (stopping the slave for a few seconds). We maintain a hot swap slave (read-only) which will jump in for the master if it fails. So such a short outage is fine on the slave.

The solution is as follows:

  • stop MySQL slave
  • perform a filesystem sync
  • perform a LVM snapshot
  • mount the snapshot somewhere
  • start another MySQL instance based on this snapshot for a happy backup

I thought that this method was pretty nice. I did not work out the details yet. But I thought this was a nice idea and it seems to work according to a colleague.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Cloudera, Spark and MySQL

I am using a Cloudera Cluster (CDH-5.4.2-1.cdh5.4.2.p0.2) to run Spark (1.3.0). I wanted to access data from an MySQL database:

val photos = sqlContext.load(
    "jdbc", 
    Map(
        "driver" -> "com.mysql.jdbc.Driver", 
        "url" -> "jdbc:mysql://testserver:3306/test?user=tester&password=testing", 
        "dbtable" -> "photo"))
photos.count

Unfortunately, this does not work right out of the box. The first thing I got was a ClassNotFoundException:

java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
...

Now, the Spark team tell you to add the MySQL driver JAR to your classpath. Most of the time by adding it to the compute-classpath.sh on your driver (or bundle it into the JAR) as well as on all your workers. This did not work for me.

After I while of trying things I noticed a file called classpath.txt at etc/spark/conf on the Cloudera Master (and only on the master) listing a bunch of JARs. To this JAR I added the path to my MySQL driver JAR:

/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/jars/mysql-connector-java-5.1.23.jar

Having done this and adding this JAR to all workers at the same path. Finally made the first Exception go away. However, the next one was waiting:

java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
        at org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:229)
        at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:225)
        at org.apache.spark.sql.hive.HiveContext$QueryExecution.<init>(HiveContext.scala:373)
...
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
...

This, is due to Spark providing a HIVE context by default. You can either replace the current sqlContext (as I usually do) or create a separate one:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// val sqlContext2 = new org.apache.spark.sql.SQLContext(sc)

After these steps I was able to access MySQL via Spark.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Linux: Splitting files in two

Here are two scripts splitting the lines of a file into two files based on a given ratio.

#!/bin/bash
 
# This script writeis the first part of the lines from the given input file into one output file and the rest of the lines into another output file. 
# The frist output file (with the postfix ".ratio") will contain a number of lines corresponding to the given ratio.
# The second output file (with the postfix ".rest") will contain the remaining lines.
# Both output files will have the given prefix.
 
# Arguments:
# 1: file name
# 2: split ratio (0;1)
# 3: output prefix
 
lines=$(wc -l $1 | sed "s/\s*[0-9][0-9]*.*/\1/g")
 
split=$(echo "$lines * $2" | bc -l)
split=${split%.*} # flooring number of lines for first file
 
echo $lines
echo $split
 
awk "{ if (NR <= $split) print \$0 > \"$3.ratio\"; else print \$0 > \"$3.rest\"}" $1
#!/bin/bash
#!/bin/bash
 
# Based on the given ratio this script will radomly distribute the lines of the given input file into two output files. 
# The frist output file (with the postfix ".ratio") will contain a number of lines roughly corresponding to the given ratio.
# The second output (with the postfix ".rest") file will contain the remaining lines.
# Both output files will have the given prefix.
 
# Arguments:
# 1: file name
# 2: split ratio (0;1)
# 3: output prefix
 
awk "BEGIN {srand()} !/^$/ { if (rand() <= $2) print \$0 > \"$3.ratio\"; else print \$0 > \"$3.rest\"}" $1
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Spring MVC: properties in the Application Context VS in the Servlet Context

I was deploying a web app based on Spring MVC (3.2.6.RELEASE). In this web app I was trying to use properties in the application context as well as in the servlet context. Now, I determined experimentally (I would have to check the code to be absolutely sure) that

in the application context when using missing properties an exception will be raised while in the servlet context there will not.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

MySQL: Install locally (not as root) from binaries

Under Ubuntu, I have tried to setup MySQL from binaries for a local user (not root) with another MySQL instance already running. This works, but is documented rather vaguely (I did not find anything that documented the whole process). So I will sum up the solution I have come up with here:

  • download a MySQL archive and extract it to /mysql/install
  • create the directory /mysql/data
  • create the directory /mysql/run
  • create a custom /mysql/my.cnf with the following content:

    [mysqld]
     
    basedir=/mysql/install
    datadir=/mysql/data
     
    socket=/mysql/run/socket
    pid-file=/mysql/run/pid
     
    port=33060
     
     
    [client]
    socket=/mysql/run/socket
    port=33060
     
     
    [mysqld_safe]
  • set an environment variables for ease of use

    export MYSQL=/mysql
    export MYSQL_INSTALL=$MYSQL/install
  • to install MySQL (prior to MySQL 5.7.7) run

    $MYSQL_INSTALL/scripts/mysql_install_db --defaults-file=$MYSQL/my.cnf

    for later versions run

    $MYSQL_INSTALL/bin/mysqld --defaults-file=$MYSQL/my.cnf --initialize-insecure
  • now you can start the server using

    $MYSQL_INSTALL/bin/mysqld_safe --defaults-file=$MYSQL/my.cnf &
  • to stop the server use

    $MYSQL_INSTALL/bin/mysqladmin --defaults-file=$MYSQL/my.cnf -u root -p shutdown --port 33060 # passwort is empty at the moment
  • to set the admin password (after starting the MySQL server) run

    $MYSQL_INSTALL/bin/mysqladmin --defaults-file=$MYSQL/my.cnf -u root password
  • running the suggested mysql_secure_installation you have to modify the bin/mysql_secure_installation script (source)

    make_config() {
        echo "# mysql_secure_installation config file" >$config
        echo "[mysql]" >>$config
        echo "user=root" >>$config
        esc_pass=`basic_single_escape "$rootpass"`
        echo "password='$esc_pass'" >>$config
        #sed 's,^,> ,' < $config  # Debugging
     
        # ADD THIS LINE
        echo "socket=/mysql/run/socket" >> $config
    }

    then you can simply run

    $MYSQL_INSTALL/bin/mysql_secure_installation
  • to connect to the MySQL server as root, run the following

    $MYSQL_INSTALL/bin/mysql --defaults-file=$MYSQL/my.cnf -u root -p

    or assuming that MySQL is already installed system-wide, there is a little less overhead:

    mysql --defaults-file=$MYSQL/my.cnf -u root -p

Now, the reason for using the --defaults-file option lies in the order of reading options files (e.g., /etc/my.cnf) regarding the --user option. In contrast to other options, MySQL will always use the --user option it reads first instead of using the option it reads last. Thus, if a global instance of MySQL is installed that sets the --user option and the corresponding options file is located in a default look-up directory (again, e.g., /etc/my.cnf), then the local installation will fail since switching users from the one specified by the global MySQL instance (e.g., in /etc/my.cnf) to the current user is blocked by MySQL (source). By setting the --defaults-file option, MySQL will only use the options file given by this parameter skipping all other look-up directories. Thus, this solves the problem.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

MySQL 5.5 vs 5.6: ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

I was building an index which was too long using MySQL 5.5. MySQL returned a

WARNING 1071 (42000): Specified key was too long; max key length is 767 bytes

The index got silently truncated. And, at least on the surface, everything worked.

Now I had someone else using my script but on MySQL 5.6. Which returned an ERROR:

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

Turns out, this is intended behavior on MySQL 5.6:

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Tomcat, web.xml and "[xX][mM][lL]" is not allowed

Problem

So, I had the following issue when trying to run my web application from Eclipse deploying to a local Tomcat.

The processing instruction target matching "[xX][mM][lL]" is not allowed.

Solution

It turned out that I had an empty line at the beginning of my web.xml!

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

Incremental Backups on Ubuntu using Duplicity with Samba

Based on Ubuntu 12.04, Samba client [1] and Duplicity [2].

  • Mount samba (see [1])
    • install cifs-utils
      sudo apt-get install cifs-utils
    • create file /home/your_username/Documents/.backup-samba-credentials with content

      username=your_samba_username 
      password=***
    • create mount directory

      mkdir /home/your_username/backup
    • edit /etc/fstab

      # samba - backup - kali
      //samba_server_ip/your_samba_username       /home/your_username/backup     cifs    credentials=/home/your_username/Documents/.backup-samba-credentials,uid=1000,gid=1000,noauto,users     0       0

      You can get the uid and the gid via

      id -u your_username # the uid
      id -g your_username # the gid
    • Now you can mount and unmount:
      mount /home/your_username/backup # mount
      umount /home/your_username/backup # unmount
  • Run the following script which uses duplicity [2]:
    BACKUP_FOLDER=/home/your_username/backup
    SOURCE_FOLDER=/media/data/content
    FULL_BACKUP_INTERVAL=2W
     
    echo "Mounting backup folder: $BACKUP_FOLDER"
    mount $BACKUP_FOLDER
     
    echo "Sleeping for a few seconds."
    sleep 3
     
    echo "Backing up: $SOURCE_FOLDER"
    COMMAND="duplicity --full-if-older-than $FULL_BACKUP_INTERVAL $SOURCE_FOLDER file://$BACKUP_FOLDER"
    $COMMAND
     
    echo "Sleeping for a few seconds."
    sleep 3
     
    echo "Unmounting backup folder: $BACKUP_FOLDER"
    umount $BACKUP_FOLDER
VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)