RDBMS to Snowflake Copying

RDBMS to Snowflake Copying

Published: None

Source: https://www.linkedin.com/pulse/rdbms-snowflake-copying-arabinda-mohapatra-pd3tc?trackingId=xP9nN9ecSTuF1gxQ9dl6jQ%3D%3D


RDBMS to Snowflake Copying

Running Kafka streams after dark, diving into genetic code by daylight, and wrestling with Databricks and Tableflow in every spare moment—sleep is optional

Problem Statement:

Migrating a 100M+ record Oracle table to Snowflake using a single Databricks worker with only 8GB of memory requires a highly optimized, memory-aware approach to avoid failures and ensure stable execution within a strict 2-hour timeframe. Traditional methods risk out-of-memory errors and inefficient resource use, demanding a strategy that prioritizes controlled data streaming, minimal memory footprint, and efficient write operations without requiring additional cluster resources.

(Using Databricks, PyArrow, and Native Spark)

❌ The Common Mistake:

* Using a single-threaded Pandas to_sql() = guaranteed memory wall & failure.

* Direct JDBC reads without tuning = slow and unstable.

✅ The Optimized Strategy:

1. Source: Parallel Oracle Read

* Tune JDBC fetchsize & defaultRowPrefetch.

* Calculate & cap partitions for optimal parallelism.

* Use predicate pushdown for efficient data splitting.

2. Stage: Snappy Parquet on DBFS

* Repartition data to avoid small files.

* Write to Snappy Parquet for columnar efficiency

  • Avoid small files—use coalesce() or repartition() smartly
  • Parquet block size 128MB

* Creates a stable, distributed checkpoint.


3. Process: Chunk with PyArrow, NOT Pandas

Iterate by *row groups**, not the entire dataset.

  • Use pyarrow.parquet.ParquetFile() to open the file
  • Read each group using .read_row_group() or .read_raw_group()

*Crucially skip Pandas conversion** to avoid driver OOM.

* Convert directly to Spark DataFrame for distributed processing.

4. Load: Native Spark Writer to Snowflake

* Use the spark-snowflake connector for parallel, high-throughput writes.

* Leverages Snowflake's Apache Arrow integration for speed.

5. Stabilize: Explicit Memory Cleanup

* Delete DataFrame references after each chunk.

  • Delete the DataFrame object
  • Drop all columns explicitly to release memory

* Manually trigger garbage collection (`gc.collect()`).

* Prevents memory leaks in long-running jobs.

The Result:

Achieves stable execution for 100M+ records on a single 8GB worker node, completing within a predictable 2-hour window.(Approx)

Declartion: The views and opinions expressed in this presentation are those of the author and do not necessarily reflect the official policy or position of Novartis or any of its affiliates or officers



Comments