Apache ORC 格式简介和使用工具读写

本贴最后更新于 2556 天前,其中的信息可能已经物是人非

背景

列存储格式学习一文中,我们了解了各种列存储格式的基本知识。

其中存储空间成本最优的格式为 ORC 格式。

本文针对性的来看一下 ORC 格式。

特点

ORC 格式在 2013 年 1 月提出,目的是提高 Hadoop Hive 的计算性能和存储效率。

另外,做为一种存储格式,其是一种自描述的列存储格式,为大规模流式读取而专门设计优化。使用自描述的列存储格式的优点是,可以根据不同的列类型选择不同的压缩算法。

主要特点为:

典型使用案例

官网上的用户案例时间比较久了,不过还是有可参考性。

使用示例

官方提供了一些快速使用示例

官方也同时提供了一些工具

本文只关注最后的 Java 示例

mac 下安装

这里以 C++ 版本使用为例.

下载

wget http://mirror.bit.edu.cn/apache/orc/orc-1.5.1/orc-1.5.1.tar.gz

编译

% mkdir build % cd build % cmake .. % make package test-out

结果

CPack: Create package CPack: - package: /Users/note/orc/orc-1.5.1/build/ORC-1.5.1-Darwin.tar.gz generated. Test project /Users/note/orc/orc-1.5.1/build Start 1: orc-test 1/3 Test #1: orc-test ......................... Passed 1.62 sec

orc-contents 使用

以 JSON 格式显示 ORC 文件内容

语法

% orc-contents [--columns=1,2,...] <filename>

示例

$ orc-contents examples/TestOrcFile.test1.orc {"boolean1": false, "byte1": 1, "short1": 1024, "int1": 65536, "long1": 9223372036854775807, "float1": 1, "double1": -15, "bytes1": [0, 1, 2, 3, 4], "string1": "hi", "middle": {"list": [{"int1": 1, "string1": "bye"}, {"int1": 2, "string1": "sigh"}]}, "list": [{"int1": 3, "string1": "good"}, {"int1": 4, "string1": "bad"}], "map": []} {"boolean1": true, "byte1": 100, "short1": 2048, "int1": 65536, "long1": 9223372036854775807, "float1": 2, "double1": -5, "bytes1": [], "string1": "bye", "middle": {"list": [{"int1": 1, "string1": "bye"}, {"int1": 2, "string1": "sigh"}]}, "list": [{"int1": 100000000, "string1": "cat"}, {"int1": -100000, "string1": "in"}, {"int1": 1234, "string1": "hat"}], "map": [{"key": "chani", "value": {"int1": 5, "string1": "chani"}}, {"key": "mauddib", "value": {"int1": 1, "string1": "mauddib"}}]}

orc-metadata 使用

以 JSON 格式显示 ORC 文件的元数据信息。

verbose 显示 ORC 文件布局信息
raw 直接显示 protocol buffers 内容,而不是解释其格式

语法

% orc-metadata [-v] [--raw] <filename>

示例

$ orc-metadata examples/TestOrcFile.test1.orc { "name": "examples/TestOrcFile.test1.orc", "type": "struct<boolean1:boolean,byte1:tinyint,short1:smallint,int1:int,long1:bigint,float1:float,double1:double,bytes1:binary,string1:string,middle:struct<list:array<struct<int1:int,string1:string>>>,list:array<struct<int1:int,string1:string>>,map:map<string,struct<int1:int,string1:string>>>", "rows": 2, "stripe count": 1, "format": "0.12", "writer version": "HIVE-8732", "compression": "zlib", "compression block": 10000, "file length": 1711, "content": 1015, "stripe stats": 250, "footer": 421, "postscript": 24, "row index stride": 10000, "user metadata": { }, "stripes": [ { "stripe": 0, "rows": 2, "offset": 3, "length": 1012, "index": 570, "data": 243, "footer": 199 } ] }

orc-statistics 工具

显示文件级别和 strip 级别的列统计信息

withIndex 可以设置来包含每个组的列统计信息。

语法

% orc-statistics [--withIndex] <filename>

示例

$ orc-statistics examples/TestOrcFile.test1.orc File examples/TestOrcFile.test1.orc has 24 columns *** Column 0 *** Column has 2 values and has null value: no *** Column 1 *** Data type: Boolean Values: 2 Has null: no (true: 1; false: 1) *** Column 2 *** Data type: Integer Values: 2 Has null: no Minimum: 1 Maximum: 100 Sum: 101 *** Column 3 *** Data type: Integer Values: 2 Has null: no Minimum: 1024 Maximum: 2048 Sum: 3072 *** Column 4 *** Data type: Integer Values: 2 Has null: no Minimum: 65536 Maximum: 65536 Sum: 131072 *** Column 5 *** Data type: Integer Values: 2 Has null: no Minimum: 9223372036854775807 Maximum: 9223372036854775807 Sum: not defined *** Column 6 *** Data type: Double Values: 2 Has null: no Minimum: 1 Maximum: 2 Sum: 3 *** Column 7 *** Data type: Double Values: 2 Has null: no Minimum: -15 Maximum: -5 Sum: -20 *** Column 8 *** Data type: Binary Values: 2 Has null: no Total length: 5 *** Column 9 *** Data type: String Values: 2 Has null: no Minimum: bye Maximum: hi Total length: 5 *** Column 10 *** Column has 2 values and has null value: no *** Column 11 *** Column has 2 values and has null value: no *** Column 12 *** Column has 4 values and has null value: no *** Column 13 *** Data type: Integer Values: 4 Has null: no Minimum: 1 Maximum: 2 Sum: 6 *** Column 14 *** Data type: String Values: 4 Has null: no Minimum: bye Maximum: sigh Total length: 14 *** Column 15 *** Column has 2 values and has null value: no *** Column 16 *** Column has 5 values and has null value: no *** Column 17 *** Data type: Integer Values: 5 Has null: no Minimum: -100000 Maximum: 100000000 Sum: 99901241 *** Column 18 *** Data type: String Values: 5 Has null: no Minimum: bad Maximum: in Total length: 15 *** Column 19 *** Column has 2 values and has null value: no *** Column 20 *** Data type: String Values: 2 Has null: no Minimum: chani Maximum: mauddib Total length: 12 *** Column 21 *** Column has 2 values and has null value: no *** Column 22 *** Data type: Integer Values: 2 Has null: no Minimum: 1 Maximum: 5 Sum: 6 *** Column 23 *** Data type: String Values: 2 Has null: no Minimum: chani Maximum: mauddib Total length: 12 File examples/TestOrcFile.test1.orc has 1 stripes *** Stripe 0 *** --- Column 0 --- Column has 2 values and has null value: no --- Column 1 --- Data type: Boolean Values: 2 Has null: no (true: 1; false: 1) --- Column 2 --- Data type: Integer Values: 2 Has null: no Minimum: 1 Maximum: 100 Sum: 101 --- Column 3 --- Data type: Integer Values: 2 Has null: no Minimum: 1024 Maximum: 2048 Sum: 3072 --- Column 4 --- Data type: Integer Values: 2 Has null: no Minimum: 65536 Maximum: 65536 Sum: 131072 --- Column 5 --- Data type: Integer Values: 2 Has null: no Minimum: 9223372036854775807 Maximum: 9223372036854775807 Sum: not defined --- Column 6 --- Data type: Double Values: 2 Has null: no Minimum: 1 Maximum: 2 Sum: 3 --- Column 7 --- Data type: Double Values: 2 Has null: no Minimum: -15 Maximum: -5 Sum: -20 --- Column 8 --- Data type: Binary Values: 2 Has null: no Total length: 5 --- Column 9 --- Data type: String Values: 2 Has null: no Minimum: bye Maximum: hi Total length: 5 --- Column 10 --- Column has 2 values and has null value: no --- Column 11 --- Column has 2 values and has null value: no --- Column 12 --- Column has 4 values and has null value: no --- Column 13 --- Data type: Integer Values: 4 Has null: no Minimum: 1 Maximum: 2 Sum: 6 --- Column 14 --- Data type: String Values: 4 Has null: no Minimum: bye Maximum: sigh Total length: 14 --- Column 15 --- Column has 2 values and has null value: no --- Column 16 --- Column has 5 values and has null value: no --- Column 17 --- Data type: Integer Values: 5 Has null: no Minimum: -100000 Maximum: 100000000 Sum: 99901241 --- Column 18 --- Data type: String Values: 5 Has null: no Minimum: bad Maximum: in Total length: 15 --- Column 19 --- Column has 2 values and has null value: no --- Column 20 --- Data type: String Values: 2 Has null: no Minimum: chani Maximum: mauddib Total length: 12 --- Column 21 --- Column has 2 values and has null value: no --- Column 22 --- Data type: Integer Values: 2 Has null: no Minimum: 1 Maximum: 5 Sum: 6 --- Column 23 --- Data type: String Values: 2 Has null: no Minimum: chani Maximum: mauddib Total length: 12

csv-import 工具

将 CSV 文件内容导入 ORC 文件。目前此工具不支持复合类型。

delimiter 来标识分隔符,默认为 ,
stripe 为分片大小,默认为 128MB
block 为压缩块大小,默认为 64KB
batch 默认为 1024 行

语法

% csv-import [--delimiter=<character>] [--stripe=<size>] [--block=<size>] [--batch=<size>] <schema> <inputCSVFile> <outputORCFile>

示例

$ csv-import "struct<a:bigint,b:string,c:double>" examples/TestCSVFileImport.test10rows.csv /tmp/test.orc [2018-06-21 23:42:26] Start importing Orc file... [2018-06-21 23:42:26] Finish importing Orc file. [2018-06-21 23:42:26] Total writer elasped time: 0.001386s. [2018-06-21 23:42:26] Total writer CPU time: 0.001374s.

orc-scan 工具

扫描和显示 ORC 文件行数,用来检查文件是否损坏。batch 默认值为 1024。

语法

% orc-scan [--batch=<size>] <filename>

示例

$ orc-scan examples/TestOrcFile.test1.orc Rows: 2 Batches: 1

参考

  • 存储
    22 引用 • 28 回帖 • 1 关注

相关帖子

欢迎来到这里!

我们正在构建一个小众社区,大家在这里相互信任,以平等 • 自由 • 奔放的价值观进行分享交流。最终,希望大家能够找到与自己志同道合的伙伴,共同成长。

注册 关于
请输入回帖内容 ...