Using S3DistCP to Merge Many Small S3 Files

At some point, many companies face the reality that using S3 as input for their MapReduce or Spark jobs can be terribly slow. One of the primary reasons for this is having too many small files as input which might also be uncompressed. If this describes your predicament, you can use S3DistCP to merge and compress your files for better MapReduce/Spark performance and lower storage costs.

Unfortunately, the documentation around using S3DistCP is not great, so I wanted to document some tips here.

There are several ways to find and download s3distcp, I used s3cmd:

s3cmd get s3://elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar

s3distcp relies on many dependencies, which are included in the aws-java-sdk:

wget http://sdk-for-java.amazonwebservices.com/latest/aws-java-sdk.zip

The goal is to run a command which might look a little something like the following which means we need to get our classpath and s3 credentials set up:

hadoop jar /tmp/s3distcp.jar -D mapred.child.java.opts=-Xmx1024m -D s3DistCp.copyfiles.mapper.numWorkers=1 -libjars ${LIBJARS} --src s3n://data/2015/* --dest hdfs:///data/compressed/ --groupBy '.*(1).*' --targetSize 256 --outputCodec snappy

To configure the classpath, you’ll want to export two environment variables, LIBJARS and HADOOP_CLASSPATH which include jars from the unzipped aws-java-sdk. Let’s say you unzipped it in /tmp, setting your classpath would like something like this:

export LIBJARS=/tmp/aws-java-sdk-1.9.40/third-party/jackson-core-2.3.2/jackson-core-2.3.2.jar,/tmp/aws-java-sdk-1.9.40/third-party/aspectj-1.6/aspectjweaver.jar,/tmp/aws-java-sdk-1.9.40/third-party/aspectj-1.6/aspectjrt.jar,/tmp/aws-java-sdk-1.9.40/third-party/freemarker-2.3.18/freemarker-2.3.18.jar,/tmp/aws-java-sdk-1.9.40/third-party/joda-time-2.2/joda-time-2.2.jar,/tmp/aws-java-sdk-1.9.40/third-party/javax-mail-1.4.6/javax.mail-api-1.4.6.jar,/tmp/aws-java-sdk-1.9.40/third-party/jackson-databind-2.3.2/jackson-databind-2.3.2.jar,/tmp/aws-java-sdk-1.9.40/third-party/httpcomponents-client-4.3/httpclient-4.3.jar,/tmp/aws-java-sdk-1.9.40/third-party/httpcomponents-client-4.3/httpcore-4.3.jar,/tmp/aws-java-sdk-1.9.40/third-party/jackson-annotations-2.3.0/jackson-annotations-2.3.0.jar,/tmp/aws-java-sdk-1.9.40/third-party/commons-logging-1.1.3/commons-logging-1.1.3.jar,/tmp/aws-java-sdk-1.9.40/third-party/spring-3.0/spring-core-3.0.7.jar,/tmp/aws-java-sdk-1.9.40/third-party/spring-3.0/spring-context-3.0.7.jar,/tmp/aws-java-sdk-1.9.40/third-party/spring-3.0/spring-beans-3.0.7.jar,/tmp/aws-java-sdk-1.9.40/third-party/commons-codec-1.6/commons-codec-1.6.jar,/tmp/aws-java-sdk-1.9.40/lib/aws-java-sdk-1.9.40.jar

Do the same for HADOOP_CLASSPATH

Credentials for s3 go in your core-site.xml file. The keys are: fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey

With the classpath and credentials in place, you should be all set!

For more information about available s3distcp options, see: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

For more tips on optimizing Hadoop and Spark jobs against S3 data, check out this great post from the Mortar blog: http://blog.mortardata.com/post/58920122308/s3-hadoop-performance