Big Data has a plethora of Data File Formats - its important to understand their strengths and weaknesses. Most explorers start out with some NoSQL exported JSON data. However, specialized data structures are required - because putting
each blob of binary data into its own file just doesn’t scale across a distributed filesystem.
Row-oriented
File Formats
(Sequence,
Avro ) – best for large number of columns of a single row needed
for processing at the same time. general-purpose, splittable,
compressible formats
rows stored contiguously in the file.
Suitable
for streaming - can be read up to last sync point offset after a
writer failure.
SequenceFile
record
==
line of text. persistent data structure for binary key-value pairs.
Key
- eg timestamp represented by LongWritable - value == a
Writable
HDFS
/ MapReduce are optimized for large files:
pack files into a SequenceFile efficient
storing / processing of
smaller files
positions
of the sync
points can
be used to resynchronize with a record boundary if the reader is
“lost”
hadoop
fs command -text
option - display sequence files in textual form: can recognize
gzipped / sequence / Avro files
% hadoop fs -text numbers.seq | head
sequence
file consists of a header followed by 1
or more records.first 3
bytes ==
SEQ
magic number; followed by 1
byte for
version number. header contains other fields: key / value class
names,
compression details, user-defined metadata, the sync marker.[
Each
file has a randomly generated sync marker stored in the header.
If
no compression is enabled (default), each record is made up of record
length (in bytes) +
key length +
key +
value. Block compression compresses multiple records at once; more
compact &
preferred - Records are added to a block until it reaches a minimum
size in bytes, defined by
io.seqfile.compress.blocksize property(default
1MB). A sync marker
is written before start of every block.
MapFile
a
sorted SequenceFile with an index to permit lookups by key. The index
is itself a SequenceFile that contains every 128th key - loaded into
memory for
fast lookups
similar
interface to SequenceFile for reading and writing — but
when
writing using MapFile.Writer,
map entries must be added in order, otherwise IOException.
MapFile
variants:
SetFile
- store
a set of Writable
keys.
ArrayFile
- key == integer
index of array
element, value ==
Writable.
BloomMapFile-
fast get()
for
sparse files via
dynamic Bloom filter in-memory over
key
Avro
language-neutral
data serialization system specifies an object
container format for
sequences of objects- portability.
like
sequence files (compact
, splittable) , but has
schema → makes it
language agnostic
self-describing
metadata section JSON schema (.avsc
) -->
compact binary
encoded values do not need to be tagged with a field identifier.
Read schema
need not be identical to write
schema -->
schema evolution.
File
format - similar
in design to Hadoop’s SequenceFile
format, but
designed to be A)
portable
across languages, B)
splittable,
C)
sortable
Avro
primitive types: null
, boolean
, int
, long
, float
, double
, bytes
, string
Avro
complex types: array
,record
,map
,enum
, fixed
, union
schema
supports Generic mapping,
Reflect mapping,
schema evolution and multiple language bindings:
Column-Oriented File Formats
(Parquet,
RCFile, ORCFile) –
best for
queries access only a small number of columns
rows
in a file (or Hive table
) are broken up into row splits, then each split is stored in
column-oriented fashion: values for each row in the first column are
stored first, followed by the values for each row in the second
column, and so on.
columns
that are not accessed in a query are
skipped.
Disadv:
-
need
more memory for reading / writing, since they have to buffer a row
split in memory, rather than just a single row.
-
not
possible to control when writes occur (via flush or sync
operations) -not suited to streaming writes, as the current file
cannot be recovered if writer process fails.
ORCFile
Hive
Optimized Record Columnar
File
Parquet
Flat
columnar storage file
format - store deep
nested data efficiently
(file
size, query perf).
values
from a
column are stored contiguously
--> efficient encoding,
query engine skips
over un-needed columns
compare
to Hive ORCFile (Optimized
Record Columnar File)
from
Google Dremel , Twitter ,
Cloudera
language-neutral
- can use in-memory data models for Avro, Thrift, Protocol Buffers
to read / write
Dremel
nested encoding –
each value is encoded as
two ints: definition level , repetition level -→nested
values can be read
independently
primitive
types:
boolean,
int32/64/96,
float/double
, binary
,fixed_len_byte_array
schema-
root: message
contains
fields. Each field has a repetition (required,
optional, repeated),
type, name.
Group
- complex type
Logical
types: UTF8
, ENUM
,DECIMAL(precision,scale)
,LIST
,DATE, MAP
message m {
required int32 field1;
required binary field2 (UTF8);
required group a (LIST) {
repeated group list {
required int32 element;
}
}
required int32 a (DATE);
repeated group key_value {
required binary key (UTF8);
optional int32 value;
}
}
File
Format: Header, blocks, footer.
-
Header:
4-byte magic number, PAR1
-
Footer
– stores all file
metadata: format version, schema, block
metadata, length, PAR1
→ 2 seeks to read
metadata, but then don’t
need sync markers for
block boundaries -->
splittable files for
parallel processing
-
Each
block stores a row group
of column chunks (a
slice) written in pages
for each column – can
compress per column.
compression:
automatic
column type compression
(delta
encoding , run-length encoding , dictionary encoding, bit packing )
+ page
compression
(Snappy,
gzip, LZO )
ParquetOutputFormat
properties - set
at write time:
-
parquet.block.size
(128 MB) trade-off scanning efficiency vs
memory usage. If job failures due to out of memory errors, adjust
this
down.
-
parquet.page.size
(1
MB)
-
parquet.dictionary.page.size
-
parquet.enable.dictionary
-
parquet.compression
(Snappy, gzip, LZO )
Parquet
command-line tools to dump the output Parquet file for inspection:
%
parquet-tools dump output/part-m-00000.parquet