How Data Integrity will be maintain in Hadoop HDFS?

This is for Hadoop eco system like HDFS, Map reduce, Hive, Hbase, Pig, sqoop,sqoop2, Avro, solr, hcatalog, impala, Oozie, Zoo Keeper and Hadoop distribution like Cloudera, Hortonwork etc.
Posts: 81
Joined: Thu Jul 17, 2014 4:58 pm

How Data Integrity will be maintain in Hadoop HDFS?

Postby alpeshviranik » Wed Jul 23, 2014 2:01 am

How HDFS (Hadoop Distributed File System) file remain intact in HDFS? How HDFS will check the file has not been modified by other? Are Datanodes are responsible for verifying the data they receive before storing the data?


Re: How Data Integrity will be maintain in Hadoop HDFS?

Postby Guest » Wed Jul 30, 2014 10:59 pm

Hadoop will take care that data will not loose or corrupt during any process of Hadoop Framework.
HDFS calculate the checksum for all data written and veryfy checksum when it will read that data. The seperate checksum will create gor every io.bytes.per.checksum bytes of data. The default size for this property is 521 bytes. Checksum is 4 Byte long and the storage overhead is less than 1%.
All dtanodes are responsible to check checksum of their data. When client read data from checksum, they also check checksum. Tocheck the data block datanodes runs a DataBlockScanner periodically to verify Block. So if corrut data found HDFS will take replica of actual data and replace the corrupt one.
Currently all the datanodes in DFS write pipeline verify checksum. Since the current protocol includes acks from the datanodes, an ack from the last node could also serve as verification that checksum ok. In that sense, only the last datanode needs to verify checksum.
You can disable verification of checksum by setting false to setVerifyChecksum() methos in file system. You can do it before open() the file. Same way you can do it in -copyToLocal, -put or -get command.

Return to “Hadoop and Big Data”

Who is online

Users browsing this forum: No registered users and 3 guests