When to use side data distribution in Hadoop

This is for Hadoop eco system like HDFS, Map reduce, Hive, Hbase, Pig, sqoop,sqoop2, Avro, solr, hcatalog, impala, Oozie, Zoo Keeper and Hadoop distribution like Cloudera, Hortonwork etc.
mohit123
Posts: 162
Joined: Sat Sep 20, 2014 11:29 pm
Contact:

When to use side data distribution in Hadoop

Postby mohit123 » Sun Sep 21, 2014 1:55 am

When to use side data distribution in Hadoop? When can i use side data distribution by Job Configuration and when can i not in Hadoop?


Guest

Re: When to use side data distribution in Hadoop

Postby Guest » Mon Sep 22, 2014 12:46 am

Side data refers to extra static small data required by map reduce to perform job. Side data can be defined as extra read-only data needed by a job to process the main dataset. The challenge is to make side data available to all the map or reduce tasks .

it use to pass small piece of meta data to map/reduce tasks in Hadoop map Reduce. It should not use when your meta data size increase more than few Kbs.

Guest

Re: When to use side data distribution in Hadoop

Postby Guest » Mon Sep 22, 2014 1:01 am

The extra read-only data needed by a mapreduce job to process the main data set in map or reduce task is called as side data.

There are two ways to make side data available to the map or reduce tasks in hadoop MapReduce.

a) Distributed cache
b) Job Configuration



Return to “Hadoop and Big Data”

Who is online

Users browsing this forum: No registered users and 2 guests