So Keys as well as Values in Hadoop seem a little tricky.

  • Values need to implement the Writable interface
  • Keys need to implement the ComparableWritable not just the Writable interface, otherwise a ClassCastException is thrown
  • Each Key needs to specify the Comparable.compare(...) method carefully, as this is used by Hadoop to group Key-Value-Pairs for the Reducer
  • When reading Keys/Values using the Writable interface, Hadoop keeps one instance of the Key/Value class and reads values from the Writable interface for each Key-Value-Pair from the Mapper; so be careful to reset/clear members that are not completely overwritten in the Writable.readFields(DataInput) method (i.e. List fields,  when adding values to that list sequentially)
  • This is obvious, still: the reduce(...) method of a Reducer gets an Iterator<Value>, not an Iterable<Value>. This means the Iterator needs to be reset or multiple iterations will yield no values.

When setting reducers (Job.setNumReduceTasks(int)) it is important to distribute workload evenly to those reducers. Keys are distributed to partitions. Each partition gets assigned to a specific Reducer. The partitioning is done by the Partitioner. The default Partitioner is the HashPartitioner. This Partitioner does a simple modulo distribution by HashCode to each partition.  As a result evenly distributed HashCodes are important. If you cache HashCode, remember to reset the HashCode for each Key in the Writable.readFields(DataInput) method.

VN:F [1.9.17_1161]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.17_1161]
Rating: 0 (from 0 votes)