This article is a comprehensive directory of Apache Iceberg resources, including educational materials, tutorials, and hands-on exercises. Whether you're a beginner or an experienced data engineer, this guide will help you navigate the world of Apache Iceberg and its applications.
Apache Iceberg?
What is Apache Iceberg?
Apache Iceberg is open-source data lakehouse table format. That means it is a standard for how metadata defining a group of files as a table is stored. This metadata enables the files to be read and written to in the same way as a table in a data warehouses by any tool that supports the standard with the same features and ACID guarantees.
Why Does it Matter?
By operating off tables in a seperate storage layer, you can use all your favorite analytical tools on a single copy of your data.
Reducing the number of copies needed can reduce your compute costs, storage costs and network costs of your overall data platform.
By storing your data in a standard format, it reduces future migration costs when changing tooling or adopting new tools.
Who does Apache Iceberg benefit?
Data Engineers since it means less data movement so less data pipelines to manage.
Data Analysts since it means they can have more immediate access to data since it requires fewer data movements to make available especially when paired with data virtualization available in tools like Dremio which allows for Lakehouse Querying and Federated Querying (Virtualization) on one platform.
Data Scientists cause they can also have more immediate data access when training their AI/ML models.
Data Leaders since they can reduce their overall platform costs making it easier to fund other data initiatives.
Apache Iceberg Directory
Apache Iceberg Education
Here is a list of resources to help you learn Apache Iceberg:
Apache Iceberg Hands-on Tutorials
Here is a list of hands-on tutorials that will help you get started with Apache Iceberg:
Mongo/Postgres to Apache Iceberg to BI Dashboard using Git for Data and DBT
BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset
End-to-End Basic Data Engineering Tutorial (Spark, Apache Iceberg Dremio, Superset)
Apache Iceberg's Architecture
Here is a list of resources to help you learn Apache Iceberg's architecture and internals:
Puffins and Icebergs: Additional Stats for Apache Iceberg Tables
Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout
Row-Level Changes on the Lakehouse: Copy-On-Write vs. Merge-On-Read in Apache Iceberg
ACID Guarantees and Apache Iceberg: Turning Any Storage into a Data Warehouse
Getting Data into Apache Iceberg
Here is a list of resources to help you get data into Apache Iceberg:
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
How to Create a Lakehouse with Airbyte, S3, Apache Iceberg, and Dremio
How to Convert JSON Files Into an Apache Iceberg Tables with Dremio
How to Convert CSV Files into an Apache Iceberg table with Dremio
Apache Iceberg Migration
Here is a list of resources to help you migrate your data to Apache Iceberg:
Apache XTable: Converting Between Apache Iceberg, Delta Lake, and Apache Hudi
3 Ways to Convert a Delta Lake Table Into an Apache Iceberg Table
Migrating a Hive Table to an Iceberg Table Hands-on Tutorial
Streaming with Apache Iceberg
Here is a list of resources to help you stream data into Apache Iceberg:
Streaming and Batch Data Lakehouses with Apache Iceberg, Dremio and Upsolver
Streaming Data into Apache Iceberg Tables Using AWS Kinesis and AWS Glue
Partitioning with Apache Iceberg
Here is a list of resources to help you learn how to partition your data with Apache Iceberg:
Simplifying Your Partition Strategies with Dremio Reflections and Apache Iceberg
Partition Evolution: Future-Proof Partitioning and Fewer Table Rewrites with Apache Iceberg
Fewer Accidental Full Table Scans Brought to You by Apache Iceberg’s Hidden Partitioning
Maintaining and Auditing Apache Iceberg Tables
Here is a list of resources to help you maintain and audit your Apache Iceberg tables:
Compaction in Apache Iceberg: Fine-Tuning Your Iceberg Table’s Data Files
Leveraging Apache Iceberg Metadata Tables in Dremio for Effective Data Lakehouse Auditing
What is DataOps? Automating Data Management on the Apache Iceberg Lakehouse
Maintaining Iceberg Tables – Compaction, Expiring Snapshots, and More
Apache Iceberg Catalogs
Here is a list of resources to help you learn about Apache Iceberg Catalogs:
Why Thinking about Apache Iceberg Catalogs Like Nessie and Apache Polaris (incubating) Matters
Using Nessie’s REST Catalog Support for Working with Apache Iceberg Tables
The Nessie Ecosystem and the Reach of Git for Data for Apache Iceberg
Understanding the Polaris Iceberg Catalog and Its Architecture
Getting Hands-on with Polaris OSS, Apache Iceberg and Apache Spark
Querying Apache Iceberg Tables
Here is a list of resources to help you query your Apache Iceberg tables:
Hybrid Apache Iceberg Lakehouses
Here is a list of resources about implementing hybrid on-premises and cloud Apache Iceberg lakehouses:
Apache Iceberg and Other Formats
Here is a list of resources about Apache Iceberg and other formats (Apache Hudi, Apache Paimon, Delta Lake):
Exploring the Architecture of Apache Iceberg, Delta Lake, and Apache Hudi
Comparison of Data Lake Table Formats (Apache Iceberg, Apache Hudi and Delta Lake)
Table Format Partitioning Comparison: Apache Iceberg, Apache Hudi, and Delta Lake
Table Format Governance and Community Contributions: Apache Iceberg, Apache Hudi, and Delta Lake
Python and Apache Iceberg
Here is a list of resources about Apache Iceberg and Python:
Governing Apache Iceberg Tables
Miscellaneous Apache Iceberg Resources
Here is a list of miscellaneous resources to help you learn Apache Iceberg:
Evolving the Data Lake: From CSV/JSON to Parquet to Apache Iceberg
Data Sharing of Apache Iceberg tables and other data in the Dremio Lakehouse
The Value of Dremio’s Semantic Layer and The Apache Iceberg Lakehouse to the Snowflake User
The Who, What and Why of Data Reflections and Apache Iceberg for Query Acceleration
How Apache Iceberg, Dremio and Lakehouse Architecture can optimize your Cloud Data Platform Costs
Dremio’s Commitment to being the Ideal Platform for Apache Iceberg Data Lakehouses
Open Source and the Data Lakehouse: Apache Arrow, Apache Iceberg, Nessie and Dremio
Deep Dive Into Configuring Your Apache Iceberg Catalog with Apache Spark
Why Data Analysts, Engineers, Architects and Scientists Should Care about Dremio and Apache Iceberg
Data Lake Mysteries Unveiled: Nessie, Dremio, and MinIO Make Waves