Spark SQL 操作 hive 过程 rename 过程时间长
情况简介
hive 版本:1.2.1,spark 版本:2.3.0
2 亿数据去重 spark 任务时间:12.5h(4h(去重)+2.5h(不知道 spark 在干嘛,driver 端没有日志,executor 也没有日志)+6h(Rname 操作))
部分 Rename 日志。
2019-09-19 22:34:22,097 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00002-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00002-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,111 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00003-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00003-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,128 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00004-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00004-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,143 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00005-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00005-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,160 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00006-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00006-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,175 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00007-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00007-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,192 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00008-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00008-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,207 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00009-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00009-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,223 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00010-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00010-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,238 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00011-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00011-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,253 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00012-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00012-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,267 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00013-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00013-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,281 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00014-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00014-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,296 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00015-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00015-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,315 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00016-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00016-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,331 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00017-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00017-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,345 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00018-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00018-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
2019-09-19 22:34:22,361 [Driver] INFO hive.ql.metadata.Hive - Renaming src: hdfs://cluster/apps/hive/warehouse/partitioned/.hive-staging_hive_2019-09-19_19-06-31_561_3933642985231072924-1/-ext-10000/date=2018-02-24/part-00019-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, dest: hdfs://cluster/apps/hive/warehouse/partitioned/date=2018-02-24/part-00019-50d4ca62-4853-46d4-a1c9-2c15544290f4.c000, Status:true
spark sql 执行 hive sql 任务
- 会现在目标表中(1.21 版本之后是默认位置目标表的文件夹)生成一个以.hive-staging 开头的临时文件夹,结果会在临时文件夹存放
- 执行完成后会,将临时文件夹 rename,放到对应的目标表文件下。
从代码中可以看出,有两种策略:如果源目录和目标目录是同一个根目录,则会源目录下的每个文件执行复制操作。反之,执行 remane 操作(只涉及 namenode 元数据,不会有额外数据操作)。
解决方案
修改 hive-site.xml 配置文件参数:
<property>
<name>hive.exec.stagingdir</name>
<value>/tmp/hive/.hive-staging</value>
<description>hive任务生成临时文件夹地址</description>
</property>
<property>
<name>hive.insert.into.multilevel.dirs</name>
<value>true</value>
<description>hive.insert.into.mulltilevel.dirs设置成false的时候,insert 目标目录的上级目录必须存在;trued的时候允许不存在</description>
</property>
欢迎来到这里!
我们正在构建一个小众社区,大家在这里相互信任,以平等 • 自由 • 奔放的价值观进行分享交流。最终,希望大家能够找到与自己志同道合的伙伴,共同成长。
注册 关于