API, Parquet: Map geometry and geography to Parquet logical types#16765
API, Parquet: Map geometry and geography to Parquet logical types#16765huan233usc wants to merge 4 commits into
Conversation
e0c3e18 to
f724354
Compare
f724354 to
f58e12b
Compare
Map Iceberg geometry and geography to and from Parquet's geometry / geography logical type annotations on a BINARY column, passing the resolved CRS and edge algorithm through directly (Iceberg and Parquet use the same algorithm names; the read side resolves names with EdgeAlgorithm.fromName, the same conversion used when parsing geography type strings, and maps unset annotation parameters to the Iceberg defaults). To make the plain geography type round-trip through writers that omit default parameters (an unset Parquet crs / algorithm defaults to OGC:CRS84 / SPHERICAL), GeographyType now treats an explicit default algorithm as equal to an omitted one: equals, hashCode, and toString use the resolved getters crs() / algorithm() instead of the raw nullable fields, matching how the CRS already resolves through its getter. Schema mapping only; the value read/write path and metrics handling are follow-ups. Co-authored-by: Isaac
f58e12b to
57036d7
Compare
|
Hi @szehon-ho , can you help reviewing when you have a chance? Thank you very much! |
|
A couple of clarifying notes (not blocking — just flagging for discussion): 1. Write path isn't blocked, only the read path. Now that
Since the value path is an explicit follow-up, it'd be safer to also reject 2. Serialization effect of the For reference, here's what gets written into table metadata before/after this PR. CRS is unchanged (the constructor already collapses the default to the bare form); only the algorithm changes. CRS — no change:
Algorithm — changed:
So explicitly setting |
Keep explicitly supplied default CRS and geography edge algorithm in type string serialization while continuing to compare geospatial types using resolved defaults.
Serialize geospatial types with explicit default CRS and edge algorithm while preserving semantic equality for default-equivalent forms.
|
Updated again in
This keeps Iceberg table metadata aligned with the concrete defaults written into Parquet annotations and avoids depending on future default interpretation. |
Match Spotless formatting for the explicit geography default assertion.
Summary
Map Iceberg
geometryandgeographyprimitive types to and from Parquet's geometry / geography logical type annotations on aBINARYcolumn, so the geo types survive a schema round-trip throughParquetSchemaUtil.convertin both directions.TypeToMessageType: emitgeometry/geographyasBINARYannotated withLogicalTypeAnnotation.geometryType/geographyType, passing the resolved CRS and edge-interpolation algorithm through directly. Iceberg and Parquet use the same algorithm names (SPHERICAL,VINCENTY,THOMAS,ANDOYER,KARNEY), so the algorithm is mapped by name.MessageTypeToType: read those annotations back intoTypes.GeometryType/Types.GeographyType. An unsetcrs/algorithmmaps to the Iceberg default (the Parquet defaults are the same:OGC:CRS84/SPHERICAL), and algorithm names are resolved withEdgeAlgorithm.fromName, the same conversion used when parsing geography type strings (Types.fromPrimitiveString).Types.GeometryType/Types.GeographyType: serialize with explicit CRS / edge algorithm defaults (geometry(OGC:CRS84),geography(OGC:CRS84, spherical)) so table metadata preserves the concrete defaults being used, whileequals/hashCodecompare resolved defaults so default-equivalent forms remain semantically equal. No public signature changes.This is the first step of plumbing the geo value path through Parquet. It is intentionally schema mapping only — the generic value read/write path (
BaseParquetReaders/BaseParquetWriter) and theParquetMetricsguard for geo columns are separate follow-ups, so this PR stays small and easy to review. It is purely additive: no behavior changes for non-geo types.This is 1/N for #16650
Test plan
TestParquetSchemaUtil#testGeospatialTypeRoundTripround-trips a schema with default-CRS geometry, an explicit-CRS geometry, default geography, and a geography per edge algorithm (all five, so the by-name mapping is exercised for every constant in both directions) throughParquetSchemaUtil.convert.TestParquetSchemaUtil#testGeospatialAnnotationsWithOmittedParametersreads hand-builtMessageTypes with unset / explicit / explicit-default CRS and algorithm — covering files written by engines that omit defaults — and confirms each maps to the expected Iceberg type and serializes with explicit defaults.TestTypes#testGeospatialTypeDefaultNormalizationcoversequals()/hashCode()parity for the default-CRS and default-algorithm geography forms, thatalgorithm()still reportsSPHERICAL, and that a non-default algorithm stays distinct;testGeospatialTypeToStringcovers explicit-default rendering../gradlew :iceberg-api:check :iceberg-parquet:check— clean (tests, checkstyle, revapi, spotless)../gradlew :iceberg-core:test— full core suite green; no regressions inTestSchemaParser/TestSingleValueParser/TestGeospatialTableor anywhere geography types are serialized.