Hive SQL 优化案例详解

HiveSQL 经典优化案例一：
1.1 将要执行的查询(执行了 1 个多小时才出结果)：


SELECT dt as DATA_DATE,STRATEGY,AB_GROUP,SOURCE,
    count(distinct case when lower(event) not like '%push%' and event!='corner_mark_show' then udid else null end) as DAU,
    count(case when event='client_show' then 1 else null end) as TOTAL_VSHOW,
    count(distinct case when event='client_show' then vid else null end) as TOTAL_VIDEO_VSHOW,
    count(case when event='video_play' then 1 else null end) as TOTAL_VV_VP,
    count(distinct case when event='video_play' then udid else null end) as TOTAL_USERS_VP,
    count(case when event='effective_play' then 1 else null end) as TOTAL_VV_EP,
    count(distinct case when event='effective_play' then udid else null end) as TOTAL_USERS_EP,
    sum(case when event='video_over' then duration else 0 end) as TOTAL_DURATION,
    count(case when event='video_over' then 1 else null end) as TOTAL_VOVER,
    sum(case when event='video_over' then play_cnts else 0 end) as TOTAL_VOVER_PCNTS,
    count(case when event='push_video_clk' then 1 else null end) as TOTAL_PUSH_VC,
    count(distinct case when event='app_start' and body_source = 'push' then udid else null end) as TOTAL_PUSH_START,
    count(case when event='post_comment' then 1 else null end) as TOTAL_REPLY,
    count(distinct case when event='post_comment' then udid else null end) as TOTAL_USERS_REPLY
    FROM dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl
group by dt,strategy,ab_group,source;复制代码

1.2 查询语句涉及到的表有 7.7 亿 + 数据。（查询如下)


jdbc:hive2://ks-hdp-master-01.dns.rightpad (default)> select count(*) from dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码

1.3 优化思路：既然将要执行的查询是按照 dt, strategy, ab_group, source 这 4 个字段分组，那么在建表的时候，就按这四个字段中的 N 个（1 或 2 或 3 或 4）个字段组合分区，直接让 count(distinct xx) 之类的查询定位到“更少的数据子集”，其执行效率就应该更高了（不需要每个子任务均从 7.7 亿 + 的数据中(去重)统计)。

1.4 先看每个字段将会有多少分区（因为 Hive 表分区也不宜过多，一般一个查询语句涉及到的 hive 分区应该控制在 2K 内)


jdbc:hive2://ks-hdp-master-01.dns.rightpad (default)> 
select count(distinct dt) as dis_dt, count(distinct strategy) as dis_strategy, 
count(distinct ab_group) as dis_ab_group, 
count(distinct source) as dis_source
from dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;复制代码


[hue@ks-hdp-client-v02 10:55:08 /usr/local/hue]$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 2*14*72
2016
-- 2016 个分区还可以接受。复制代码

1.5 根据原表，新建分区表，并将原表数据插入新表：


show create table dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码


jdbc:hive2://ks-hdp-master-01.dns.rightpad (default)> show create table dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码

创建新表：按 dt,source,stragegy,ab_group 分区(注意先后顺序，一般习惯分区数越少的越靠前，根据 1.5 的查询可知：dt=1,source=2,strategy=14,ab_group=72)


create external table `dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_lym`(
  event string,
  udid string,
  vid string,
  duration string,
  body_source string,
  play_cnts string
)PARTITIONED BY (
  dt string,
  source string,
  strategy string,
  ab_group string
);复制代码

将原表数据插入新表：


insert into `dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_lym` partition(dt,source,strategy,ab_group)
select event,udid,vid,duration,body_source,play_cnts,dt,source,strategy,ab_group
from `dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl`;复制代码

核对两表的数据是否一致：

1.6 基于新表执行查询(执行 5 分钟出结果)：

HiveSQL 经典优化案例二：

问题描述：一个复杂的 SQL，查询执行一段时间后报错：基本上是查不出来;

分析函数对于大表来说不是 hive 的强项，这个时候我们将其分解成很多子集，并且合理利用 hive 分区表的优势，然后去 join 。

2.1 将要执行的查询


create table bi_tmp.aloha_UserLoyalty_190301_190303 as 
    select aid, imei, idfa, udid, event, duration, dt, time_local, hour, source, 
        first_value(time_local) over(partition by udid, event order by time_local) as first_time,
        last_value(time_local) over(partition by udid, event order by time_local) as last_time,
        count(time_local) over(partition by udid, event, dt) as event_count_per_day,
        sum(duration) over(partition by udid, event, dt) as event_duration_each_day
    from dwb_v8sp.event_column_info_new_hour
    where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 
        'comment_success')
        and dt >= '2019-03-01' and dt <= '2019-03-03';
select count(*) 
from dwb_v8sp.event_column_info_new_hour
where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
and dt >= '2019-03-01' and dt <= '2019-03-03';
复制代码


select count(distinct event) as dis_event
from dwb_v8sp.event_column_info_new_hour
where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
and dt >= '2019-03-01' and dt <= '2019-03-03';复制代码

分解成三个子集，并保存到三张表： bi_tmp.zyt1, bi_tmp.zyt2, bi_tmp.zyt3


-- drop table if exists bi_tmp.zyt1;
create table bi_tmp.zyt1 partitioned by(event)
as
select udid, 
       min(time_local) as first_time,
       max(time_local) as last_time,
       event
from dwb_v8sp.event_column_info_new_hour
where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
and dt >= '2019-03-01' and dt <= '2019-03-03'
group by udid, event;
-- drop table if exists bi_tmp.zyt2 purge;
create table bi_tmp.zyt2 partitioned by(dt,event)
as
select udid, 
       count(time_local) as event_count_per_day,
       sum(duration) as event_duration_each_day,
       dt,       event
from dwb_v8sp.event_column_info_new_hour
where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
and dt >= '2019-03-01' and dt <= '2019-03-03'
group by udid, dt, event;
create table bi_tmp.zyt3 partitioned by(dt,event)
as select aid, imei, idfa, udid, duration, time_local, hour, source, dt, event
from dwb_v8sp.event_column_info_new_hour t3
    where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 
        'comment_success')
        and dt >= '2019-03-01' and dt <= '2019-03-03';
-- 插入目标表：
create table bi_tmp.aloha_UserLoyalty_190301_190303 as 
    select t3.aid, t3.imei, t3.idfa, t3.udid, t3.event, t3.duration, t3.dt, t3.time_local, t3.hour, t3.source, 
        t1.first_time,        t1.last_time,        t2.event_count_per_day,        t2.event_duration_each_day    from bi_tmp.zyt1 t1 join bi_tmp.zyt2 t2 on t1.event=t2.event and t1.udid=t2.udid
    join bi_tmp.zyt3 t3 on t2.dt=t3.dt and t2.event= t3.event and t2.udid=t3.udid;
-- 验证数据：（与上面的查询记录行数对的上)复制代码

HiveSQL 经典优化案例三：

如下 SQL，用到了 PERCENTILE_APPROX 函数，问题描述：如下 SQL，用到了 PERCENTILE_APPROX 函数，个人初步分析认为：由于用到该函数的次数太多，导致性能严重下降。

我仔细查了一下该函数，发现：它是支持“数组传参”的，那么就不难找到优化该 SQL 的方法了。

3.1 原 SQL 性能测试：

3.2 优化后的 SQL，性能测试：

优化后的 SQL，性能提升了 4 倍多。

【bigdata】4.hive 安装

hive的全部安装过程都是在master节点安装 hive 1.上传并解压 tar -zxvf apache-hive-1.2.2-bin.tar.gz -C /hive安装目录 2.配置环境 2.1 配置 hive-env.sh # 跳转到hive配置文件目录 cd /hive安装目录/conf # 修改名称 mv ..

【CDH6】安装 Hive

安装 Hive 选择集群，添加 Hive 服务[图片][图片] 添加服务向导选择依赖（只有一项可供选择时则默认跳过），点击“继续”，选择默认角色配置即可：[图片] 点击“继续”之后，需要配置 Hive 依赖的 mysql 数据库，需要在 cm-s1 节点上连接 mysql，执行创建数据库及分配权限语句： [root@ ..

(运维篇)- 使用 docker 搭建 hadoop-hive-spark 集群 (一)

[图片] 一、安装 docker 1.1 使用 brew cask 安装由于是 Mac 系统，直接使用 brew 最为方便 brew cask install docker 1.2 镜像替换为加速镜像 [图片] 如上图得两个镜像地址，填入后，点击 apply&restart 重启 docker https:/ ..

欢迎来到这里！

我们正在构建一个小众社区，大家在这里相互信任，以平等 • 自由 • 奔放的价值观进行分享交流。最终，希望大家能够找到与自己志同道合的伙伴，共同成长。

关于

Hive SQL 优化案例详解

相关帖子

【bigdata】4.hive 安装

【CDH6】安装 Hive

Hive 配置高可用 hiveserver2

CentOS7 安装 Hive 2.3.4 （远程 Metastore Server 模式）

CentOS7 安装 Hive 2.3.4 （远程 Metastore 模式）

CentOS7 安装 Hive-0.13.0

(运维篇)- 使用 docker 搭建 hadoop-hive-spark 集群 (一)

欢迎来到这里！

近期热议

推荐标签标签

最新标签

Hive SQL 优化案例详解

相关帖子

【bigdata】4.hive 安装

【CDH6】安装 Hive

Hive 配置高可用 hiveserver2

CentOS7 安装 Hive 2.3.4 （远程 Metastore Server 模式）

CentOS7 安装 Hive 2.3.4 （远程 Metastore 模式）

CentOS7 安装 Hive-0.13.0

(运维篇)- 使用 docker 搭建 hadoop-hive-spark 集群 (一)

欢迎来到这里！

近期热议

推荐标签 标签

最新标签

推荐标签标签