Creatively Technical: Designing MR Jobs to Improve Debugging

When dealing with tons of jobs each day, with tera,peta,zeta bytes of data in xmls and other files, designing our MR jobs efficiently becomes crucial to Big Data handling.

I will discuss certain ways that can help us design and develop an efficient MR job.

1. Implement Exception Handling

Most basic is to ensure implementing exception handling with all possible exceptions in the catch block, in order of most special to generic ones. Also, do not leave the catch block empty. Write proper code, that will help debugging issues. Few helpful System.err.println can be given to be viewed in Yarn logs (How to view sysouts in yarn logs) for debugging.

If you want the job to run successfully, then do not throw the exception. Rather catch it.

2. Use Counters

Consider a real world scenario where we have thousands of xml each day, and due to 1 or 2 invalid XMLs, the complete job fails.
A solution for such problem is using Counters. That can ensure how many xmls failed and on which line no, and still the job can continue successfully. The Invalid xmls can be later processed after doing the correction if needed.

Steps:
1. Create an enum in your mapper class ( outside map method)

enum InvalidXML{
Counter;
}

2. Write you XML processing code and in catch block

catch(XMLStreamException e){
context.setStatus("Detected Invalid Record in Xml. Check Logs);
context.getCounter(InvalidXML.Counter).increment(1);
}

We can also print the input file name using
String fileName = ((FileSplit) context.getInputSplit()).getPath().getName();
context.setStatus("Detected Invalid Record in Xml " + fileName+ ". Check Logs");

3. When this jobs Runs, it will complete successfully, but in Job Logs it will display the status as below

On clicking logs, we can see

Another use of Counters is in displaying the Total no of Files Processed or Records Processed. For that, counters need to be incremented in the try block.

Counters will display files and issues for the current date (how many records failed etc), but if we need to debug issues for an older date. For that we can do the following

3. Use MultiOutputFormat in catch block to write errors to separate text files with date etc.

The MultiOutputFormat class simplifies writing output data to multiple outputs.
write(String alias, K key, V value, org.apache.hadoop.mapreduce.TaskInputOutputContext context)

In the catch block, we can write something like below
catch(SomeException e){
MultiOutputFormat.write("XMLErrorLogs"+date, storeId, businessDate+e.getMessage(), context);
}

The name of the text files can be ErrorLogs etc along with date etc.
The key can be your Id on which you would like to search in hive
The value can be the content you wish to search along with the errorMessage.

Once these logs are loaded into Hive, we can easily query to see, which files from which Stores are giving more issues. Or on an earlier business date, what errors we got in which files and for which stores.
Lot of relevant information can be stored and queried for valuable analysis. This can really help to debug and support huge big data application.

Hope this article will help many.

Creatively Technical

Thursday, February 18, 2016

Designing MR Jobs to Improve Debugging

No comments:

Post a Comment