This is for Hadoop eco system like HDFS, Map reduce, Hive, Hbase, Pig, sqoop,sqoop2, Avro, solr, hcatalog, impala, Oozie, Zoo Keeper and Hadoop distribution like Cloudera, Hortonwork etc.
2 posts • Page 1 of 1
Parquet is efficient columnar storage .I would say, that both of these formats has their own specific advantages. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does.
Apache ORC might be better if your filestructure is flatter. ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups.
parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be the issue for the better query speed especially when it comes to sum operations.
The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset. If yes it looks like there is something shady about it, when it only compresses it to 1.9 GB.
parquet file do 62% small and orc do 78% small
Users browsing this forum: No registered users and 8 guests