Skip to content

API, Parquet: Map geometry and geography to Parquet logical types#16765

Open
huan233usc wants to merge 4 commits into
apache:mainfrom
huan233usc:parquet-geo-schema
Open

API, Parquet: Map geometry and geography to Parquet logical types#16765
huan233usc wants to merge 4 commits into
apache:mainfrom
huan233usc:parquet-geo-schema

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Map Iceberg geometry and geography primitive types to and from Parquet's geometry / geography logical type annotations on a BINARY column, so the geo types survive a schema round-trip through ParquetSchemaUtil.convert in both directions.

  • TypeToMessageType: emit geometry / geography as BINARY annotated with LogicalTypeAnnotation.geometryType / geographyType, passing the resolved CRS and edge-interpolation algorithm through directly. Iceberg and Parquet use the same algorithm names (SPHERICAL, VINCENTY, THOMAS, ANDOYER, KARNEY), so the algorithm is mapped by name.
  • MessageTypeToType: read those annotations back into Types.GeometryType / Types.GeographyType. An unset crs / algorithm maps to the Iceberg default (the Parquet defaults are the same: OGC:CRS84 / SPHERICAL), and algorithm names are resolved with EdgeAlgorithm.fromName, the same conversion used when parsing geography type strings (Types.fromPrimitiveString).
  • Types.GeometryType / Types.GeographyType: serialize with explicit CRS / edge algorithm defaults (geometry(OGC:CRS84), geography(OGC:CRS84, spherical)) so table metadata preserves the concrete defaults being used, while equals / hashCode compare resolved defaults so default-equivalent forms remain semantically equal. No public signature changes.

This is the first step of plumbing the geo value path through Parquet. It is intentionally schema mapping only — the generic value read/write path (BaseParquetReaders / BaseParquetWriter) and the ParquetMetrics guard for geo columns are separate follow-ups, so this PR stays small and easy to review. It is purely additive: no behavior changes for non-geo types.

This is 1/N for #16650

Test plan

  • TestParquetSchemaUtil#testGeospatialTypeRoundTrip round-trips a schema with default-CRS geometry, an explicit-CRS geometry, default geography, and a geography per edge algorithm (all five, so the by-name mapping is exercised for every constant in both directions) through ParquetSchemaUtil.convert.
  • TestParquetSchemaUtil#testGeospatialAnnotationsWithOmittedParameters reads hand-built MessageTypes with unset / explicit / explicit-default CRS and algorithm — covering files written by engines that omit defaults — and confirms each maps to the expected Iceberg type and serializes with explicit defaults.
  • TestTypes#testGeospatialTypeDefaultNormalization covers equals() / hashCode() parity for the default-CRS and default-algorithm geography forms, that algorithm() still reports SPHERICAL, and that a non-default algorithm stays distinct; testGeospatialTypeToString covers explicit-default rendering.
  • ./gradlew :iceberg-api:check :iceberg-parquet:check — clean (tests, checkstyle, revapi, spotless).
  • ./gradlew :iceberg-core:test — full core suite green; no regressions in TestSchemaParser / TestSingleValueParser / TestGeospatialTable or anywhere geography types are serialized.

@huan233usc huan233usc marked this pull request as draft June 11, 2026 02:38
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from e0c3e18 to f724354 Compare June 11, 2026 05:26
@github-actions github-actions Bot added the API label Jun 11, 2026
@huan233usc huan233usc changed the title Parquet: Map geometry and geography to Parquet logical types API, Parquet: Map geometry and geography to Parquet logical types Jun 11, 2026
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from f724354 to f58e12b Compare June 11, 2026 05:36
@huan233usc huan233usc marked this pull request as ready for review June 11, 2026 05:39
Map Iceberg geometry and geography to and from Parquet's geometry /
geography logical type annotations on a BINARY column, passing the
resolved CRS and edge algorithm through directly (Iceberg and Parquet
use the same algorithm names; the read side resolves names with
EdgeAlgorithm.fromName, the same conversion used when parsing
geography type strings, and maps unset annotation parameters to the
Iceberg defaults).

To make the plain geography type round-trip through writers that omit
default parameters (an unset Parquet crs / algorithm defaults to
OGC:CRS84 / SPHERICAL), GeographyType now treats an explicit default
algorithm as equal to an omitted one: equals, hashCode, and toString
use the resolved getters crs() / algorithm() instead of the raw
nullable fields, matching how the CRS already resolves through its
getter.

Schema mapping only; the value read/write path and metrics handling
are follow-ups.

Co-authored-by: Isaac
@huan233usc huan233usc force-pushed the parquet-geo-schema branch from f58e12b to 57036d7 Compare June 11, 2026 05:53
@huan233usc

Copy link
Copy Markdown
Contributor Author

Hi @szehon-ho , can you help reviewing when you have a chance? Thank you very much!

@szehon-ho

Copy link
Copy Markdown
Member

A couple of clarifying notes (not blocking — just flagging for discussion):

1. Write path isn't blocked, only the read path.

Now that TypeToMessageType emits geometry/geography, ParquetSchemaUtil.convert succeeds where it previously threw, so the failure point moves into the value path and the two sides behave asymmetrically:

  • Read: BaseParquetReaders.primitive runs the logical-type visitor and .orElseThrow(...), so a geo column fails fast with UnsupportedOperationException.
  • Write: BaseParquetWriter.primitive runs its logical-type visitor, but when it returns empty it falls through to case BINARY -> ParquetValueWriters.byteBuffers(desc) rather than throwing. So a geo write no longer fails at writer setup — it silently degrades to the generic binary writer.

Since the value path is an explicit follow-up, it'd be safer to also reject GEOMETRY/GEOGRAPHY on the write side here, so we don't produce files the reader then refuses to read.

2. Serialization effect of the toString change (default algorithm dropped).

For reference, here's what gets written into table metadata before/after this PR. CRS is unchanged (the constructor already collapses the default to the bare form); only the algorithm changes.

CRS — no change:

User action Before After
Sets default explicitly (OGC:CRS84) "geography" "geography"
Doesn't set it "geography" "geography"

Algorithm — changed:

User action Before After
Sets default explicitly (spherical) "geography(OGC:CRS84, spherical)" "geography"
Doesn't set it "geography" "geography"

So explicitly setting spherical is no longer persisted — it now serializes identically to not setting it. Semantically fine (the spec defaults an unspecified algorithm to spherical), but re-serializing an existing table that stored "...,spherical)" will produce the shorter string (same type, different bytes). Worth calling out in the PR description.

Xin Huang added 2 commits June 17, 2026 23:13
Keep explicitly supplied default CRS and geography edge algorithm in type string serialization while continuing to compare geospatial types using resolved defaults.
Serialize geospatial types with explicit default CRS and edge algorithm while preserving semantic equality for default-equivalent forms.
@huan233usc

Copy link
Copy Markdown
Contributor Author

Updated again in ca4e1a689 to make the behavior fully explicit by default:

  • GeometryType.crs84() now serializes as geometry(OGC:CRS84)
  • GeographyType.crs84() now serializes as geography(OGC:CRS84, spherical)
  • bare input like geometry / geography is still accepted and resolves to those defaults
  • equals / hashCode still compare resolved defaults, so default-equivalent forms remain equal

This keeps Iceberg table metadata aligned with the concrete defaults written into Parquet annotations and avoids depending on future default interpretation.

Match Spotless formatting for the explicit geography default assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants