Cloudera, Spark and MySQL

I am using a Cloudera Cluster (CDH-5.4.2-1.cdh5.4.2.p0.2) to run Spark (1.3.0). I wanted to access data from an MySQL database:

val photos = sqlContext.load(
    "jdbc", 
    Map(
        "driver" -> "com.mysql.jdbc.Driver", 
        "url" -> "jdbc:mysql://testserver:3306/test?user=tester&password=testing", 
        "dbtable" -> "photo"))
photos.count

Unfortunately, this does not work right out of the box. The first thing I got was a ClassNotFoundException:

java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
...

Now, the Spark team tell you to add the MySQL driver JAR to your classpath. Most of the time by adding it to the compute-classpath.sh on your driver (or bundle it into the JAR) as well as on all your workers. This did not work for me.

After I while of trying things I noticed a file called classpath.txt at etc/spark/conf on the Cloudera Master (and only on the master) listing a bunch of JARs. To this JAR I added the path to my MySQL driver JAR:

/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/jars/mysql-connector-java-5.1.23.jar

Having done this and adding this JAR to all workers at the same path. Finally made the first Exception go away. However, the next one was waiting:

java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
        at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:472)
        at org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:229)
        at org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:225)
        at org.apache.spark.sql.hive.HiveContext$QueryExecution.<init>(HiveContext.scala:373)
...
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
...

This, is due to Spark providing a HIVE context by default. You can either replace the current sqlContext (as I usually do) or create a separate one:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// val sqlContext2 = new org.apache.spark.sql.SQLContext(sc)

After these steps I was able to access MySQL via Spark.

VN:F [1.9.22_1171]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)

2 Responses to “Cloudera, Spark and MySQL”

  1. Martin Becker says:

    Hi Thomas, it's an interesting detail that the original SQLContext is specific to HIVE. I added this to the post. Thanks.

    VN:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VN:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

  2. Thomas says:

    It is actually not necessary to overwrite the sqlContext, since it's just another variable. The sqlContext predefined in the Spark shell is a connector to Hive. I mostly ignore that one and use another, differently named variable.

    VA:F [1.9.22_1171]
    Rating: 0.0/5 (0 votes cast)
    VA:F [1.9.22_1171]
    Rating: 0 (from 0 votes)

Leave a Reply