Data Engineering Fundamentals
Data serialization
= The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later.
Types of data serialization formats
| Format | Binary/Text | Human-readable | Example use cases | Features |
|---|---|---|---|---|
| JSON | Text | Yes | Everywhere | |
| CSV | Text | Yes | Everywhere | row-major |
| Parquet | Binary | No | Hadoop, Amazon Redshift | column-major |
| Avro | Binary primary | No | Hadoop | |
| Protobuf | Binary primary | No | Google, TensorFlow (TFRecord) | |
| Pickle | Binary | No | Python, PyTorch serialization |