Parquest vs ORC

This is for Hadoop eco system like HDFS, Map reduce, Hive, Hbase, Pig, sqoop,sqoop2, Avro, solr, hcatalog, impala, Oozie, Zoo Keeper and Hadoop distribution like Cloudera, Hortonwork etc.
Site Admin
Posts: 185
Joined: Wed Jul 16, 2014 9:22 pm

Parquest vs ORC

Postby forum_admin » Sat Sep 24, 2016 8:59 pm

Parquest vs ORC file system for hive table


Re: Parquest vs ORC

Postby Guest » Sat Sep 24, 2016 10:16 pm

Parquet is efficient columnar storage .I would say, that both of these formats has their own specific advantages. Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does.

Apache ORC might be better if your filestructure is flatter. ORCFile breaks rows into row groups and applies columnar compression and indexing within these row groups.

parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be the issue for the better query speed especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset. If yes it looks like there is something shady about it, when it only compresses it to 1.9 GB.

parquet file do 62% small and orc do 78% small

Return to “Hadoop and Big Data”

Who is online

Users browsing this forum: No registered users and 1 guest