Hive performance tuning

This is for Hadoop eco system like HDFS, Map reduce, Hive, Hbase, Pig, sqoop,sqoop2, Avro, solr, hcatalog, impala, Oozie, Zoo Keeper and Hadoop distribution like Cloudera, Hortonwork etc.
forum_admin
Site Admin
Posts: 185
Joined: Wed Jul 16, 2014 9:22 pm
Contact:

Hive performance tuning

Postby forum_admin » Sat Sep 24, 2016 9:03 pm

Hive performance tuning example with detail steps


Guest

Re: Hive performance tuning

Postby Guest » Sat Sep 24, 2016 10:09 pm

These configuration parameters must be set appropriately to turn on transaction support in Hive:
hive.support.concurrency – true
hive.enforce.bucketing – true (Not required as of Hive 2.0)
hive.exec.dynamic.partition.mode – nonstrict
hive.txn.manager – org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.compactor.initiator.on – true (for exactly one instance of the Thrift metastore service)
hive.compactor.worker.threads – a positive number on at least one instance of the Thrift metastore service


Big table and big table join, big table and small table join (1) convert join in map join, 2) optimize group by and order by operation)

Guest

Re: Hive performance tuning

Postby Guest » Sat Sep 24, 2016 10:12 pm

1) Partition and Bucket join and sort merge bucket join
set hive.execution.engine=tez;
mapreduce,job.reduce
mapreduce.fileinputformat.max.splitsize

2) number of sector can decide the number of reducer
set mapreduce,job.reduce give how many reducer hive using. It is -1 as it will decide the reducer. But it is incorrect as it has certain limitation. It will see the current dataset and decide incorrect number of reducer.

3) Too many small file issue can solve by merge it. It will create overhead by creating so many map and also cretae overhead on namenode.
Solve by creating stage tbale and insert data or partition selecting from the stage table.
Set combineTextinputfomat.class

4) Comression: hadoop checknative -a will give the details of compression library.
Mapred.output.compress=true
Mapred.compress.output.codec=org.apache.hadoop.io.compressdata.deflate
Mapred.output.compression.tpe=record,block
Mapred.compress.map.output=true
Mapred.map.output.compressioncodec=compression class
Hive.exec.compress.output=true


5) Partition: do proper partition not much (yearly partition and try to get one day data) and less partition (partition by day over 5 years data)
Better to partition all larger table on same key
It is said that compression will overhead cpu bound but it is not true in all cases.

6) Map side join: one table is small and one is larger
Set below parameter
Hive.auto.convert.join=true
Hive.map.join.smalltable.filesize=400000000=400mb—if table smaller than 400mb then it will use map side join

Select /*+ mapjoin(user) */ product.*, user.* from product join user on product.a=user.a



Return to “Hadoop and Big Data”

Who is online

Users browsing this forum: No registered users and 1 guest