HIVE-29578: Iceberg: add support for logical views#6449
Conversation
4fdad42 to
252c608
Compare
252c608 to
e10eba5
Compare
e10eba5 to
96fa476
Compare
96fa476 to
114412a
Compare
| result.setLastAccessTime(nowSec); | ||
| result.setRetention(Integer.MAX_VALUE); | ||
|
|
||
| boolean hiveEngineEnabled = false; |
There was a problem hiding this comment.
What is hiveEngineEnabled and why is it false?
There was a problem hiding this comment.
hiveEngineEnabled switches how HiveOperationsBase.storageDescriptor fills the Storage Desacriptor: with HiveIcebergInputFormat / HiveIcebergOutputFormat / HiveIcebergSerDe when true, or the usual placeholder FileInputFormat / FileOutputFormat / LazySimpleSerDe when false.
Why it’s false in toHiveView:
This path materializes an HMS VIRTUAL_VIEW for REST catalog that expose Iceberg view metadata through the HMS API. That row isn’t meant to drive a Hive table scan the way a real Iceberg table commit does; execution still comes from the view definition / catalog, not from wiring Iceberg MR formats on the stub. HiveViewOperations does the same thing (hiveEngineEnabled = false).
So we keep a minimal SD consistent with normal virtual views and avoid implying this HMS object is an Iceberg-backed table for the Hive engine. For tables, HiveTableOperations still turns engine integration on/off via metadata + ConfigProperties.ENGINE_HIVE_ENABLED where that actually matters.
| @Deprecated // kept for backwards reference | ||
| REPLACE_VIEW_WITH_MATERIALIZED(10400, "Attempt to replace view {0} with materialized view", true), | ||
| REPLACE_MATERIALIZED_WITH_VIEW(10401, "Attempt to replace materialized view {0} with view", true), | ||
| VIEW_STORAGE_HANDLER_UNSUPPORTED(10448, "Storage handler {0} doesn't support external logical views", true), |
There was a problem hiding this comment.
Logical view is maybe a too verbose. I checked the Hive documentation and error messages and I only saw this naming: a view is a view or materialized view.
Technically it is true, it is a logical view. In the naming convention that Hive uses, it is just a view.
There was a problem hiding this comment.
To me omitting the word 'logical' sounds ambiguous.
Also, we are adding support for more view types and we will have regular hive view, hive matrialized view, Iceberg (logical) view, Iceberg materialized view (your PR), I just thought it will help understanding what view type we are referring to having 'logical' in the name.
@kasakrisz what do you think?
| .toList(); | ||
| .forEach(names::add); | ||
|
|
||
| if (restCatalog instanceof ViewCatalog viewCatalog) { |
There was a problem hiding this comment.
I find it a really interesting question:
In Hive, we consider views as tables. In Iceberg, view is just a view, not a table.
I'm not sure about the purpose of the rest client:
if it serves as an interface for Hive only, adding the views to the tables is fine.
If the goal is to provide an interface to Iceberg, I would add a new endpoint that lists the view.
There was a problem hiding this comment.
HiveRESTCatalogClient is an implementation of IMetaStoreClient that allows Hive work with REST Catalog servers, so it is for Hive.
| public boolean tableExists(String catName, String dbName, String tableName) { | ||
| validateCurrentCatalog(catName); | ||
| return restCatalog.tableExists(TableIdentifier.of(dbName, tableName)); | ||
| TableIdentifier id = TableIdentifier.of(dbName, tableName); |
There was a problem hiding this comment.
It looks like a simple instanceof operator can decide if we want to get a table or a view information.
The question is: do we need the extra call in case it is a view, not a table? And how the caller will know if (or even, do the want to know) if the existing object is a view or a table?
There was a problem hiding this comment.
It looks like a simple instanceof operator can decide if we want to get a table or a view information.
The question is: do we need the extra call in case it is a view, not a table?
I don't think a simple instanceof operator can decide if we want to get a table or a view information.
RESTCatalog implements both Catalog for tables and ViewCatalog
public class RESTCatalog implements Catalog, ViewCatalog, SupportsNamespaces, Configurable<Object>, Closeable {
And how the caller will know if (or even, do the want to know) if the existing object is a view or a table?
Implementation here in REST Catalog client follows same logic as regular Hive HMS clients which do not care if the existing object is a view or a table:
For example, see SessionHiveMetaStoreClient.tableExists
| org.apache.iceberg.Table icebergTable = restCatalog.loadTable(id); | ||
| return MetastoreUtil.toHiveTable(icebergTable, conf); | ||
| } catch (NoSuchTableException tableMissing) { | ||
| if (restCatalog instanceof ViewCatalog viewCatalog) { |
There was a problem hiding this comment.
The same as my previous comment: if we can decide where to call with an instanceof, no reason to call loadTable.
There was a problem hiding this comment.
Same as above, we can't decide where to call using just an instanceof.
| params.get(BaseMetastoreTableOperations.TABLE_TYPE_PROP)); | ||
| } | ||
|
|
||
| @Override |
There was a problem hiding this comment.
What is about the other overloads of alter_table and alter_table_with_environmentContext?
There was a problem hiding this comment.
Forget about alter_table_with_environmentContext. As I see it is not implemented.
The interesting question is still the alter_table overloads. Now we provide an interface that can actually altering a view OR doing nothing OR throw an UnsupportedOperationException depending on the formal and actual parameters.
There was a problem hiding this comment.
The IMetaStoreClient client has approximately 371 interface methods.
HiveRestCatalogClient implements only 16 of them which are needed and in use for REST Catalog use cases.
I didn't encounter cases where other alter_table implementations where needed in REST client tests.
|
|
||
| private void createOrReplaceLogicalView(Table table, String dbName, String tableName) { | ||
|
|
||
| List<FieldSchema> cols = Lists.newArrayList(table.getSd().getCols()); |
There was a problem hiding this comment.
Is it possible to create a virtual view that is partitioned?
There was a problem hiding this comment.
No - Iceberg API doesn't support it.
ViewBuilder builder =
viewCatalog
.buildView(identifier)
.withSchema(HiveSchemaUtil.convert(fieldSchemas, Collections.emptyMap(), true))
.withDefaultNamespace(Namespace.of(identifier.namespace().level(0)))
.withQuery("hive", viewSql);
| public static final String ICEBERG_VIEW_HMS_TABLE_TYPE_VALUE = | ||
| HiveOperationsBase.ICEBERG_VIEW_TYPE_VALUE.toUpperCase(java.util.Locale.ENGLISH); | ||
|
|
||
| private IcebergNativeLogicalViewSupport() { |
There was a problem hiding this comment.
Thanks for pointing out.
Fixed.
| if (hasIcebergNativeViewTableType(table) && restCatalog instanceof ViewCatalog) { | ||
| createOrReplaceLogicalView(table, table.getDbName(), table.getTableName()); | ||
| } else { | ||
| List<FieldSchema> cols = Lists.newArrayList(table.getSd().getCols()); |
There was a problem hiding this comment.
Extracting the createTable logic would improve the readability.
| public void createTable(CreateTableRequest request) throws TException { | ||
| Table table = request.getTable(); | ||
| if (hasIcebergNativeViewTableType(table) && restCatalog instanceof ViewCatalog) { | ||
| createOrReplaceLogicalView(table, table.getDbName(), table.getTableName()); |
There was a problem hiding this comment.
I think it leads to inconsistent behavior: if we call the method on a table, it creates it if it is a new one and throws an AlreadyExistsException if the table already exists and also if a view with the same name is already exists.
But if we call this method on a view, it will create it even if a table exists with the same name and replaces the existing one if the view is already exists.
There was a problem hiding this comment.
This method cannot be called on a view if it already exists - see #6449 (comment)
|
|
||
| String comment = tblProps.get("comment"); | ||
| IcebergNativeLogicalViewSupport.createOrReplaceNativeView( | ||
| conf, dbName, tableName, cols, table.getViewExpandedText(), tblProps, comment); |
There was a problem hiding this comment.
The view text is pretty interesting difference between Hive and Iceberg:
Hive stores it in two versions: original and expanded text.
Iceberg has only one field, sql.
I think if we want to be compatible with both Hive and Iceberg, I would rather choose the original text here and add the expanded text as a tableproperty.
But also, there is a catch: the sql is stored in ViewVersion so it can differ for each version. But in Hive, it is a single property of the table metadata.
Honestly, I'm not sure what is the correct solution to do that but if we store only one of them, I fear we can loose functionality on Hive side.
There was a problem hiding this comment.
I believe passing table.getViewExpandedText() is more correct because SemanticAnalyzer uses expanded text when resolving views.
On the write view path, view text is stored in HMS mainly for passing it to the IcebergNativeLogicalViewSupport.createOrReplaceNativeView, but on the read path view text is always retrieved from Iceberg API both on HMS and REST paths in BaseHiveIcebergMetaHook:
@Override
public void postGetTable(org.apache.hadoop.hive.metastore.api.Table hmsTable) {
if (hmsTable != null) {
try {
if (isNativeIcebergLogicalView(hmsTable)) {
IcebergNativeLogicalViewSupport.enrichHmsTableFromIcebergView(hmsTable, conf);
return;
}
| HiveOperationsBase.ICEBERG_VIEW_TYPE_VALUE.toUpperCase(Locale.ENGLISH), | ||
| metadata.schema(), | ||
| maxHiveTablePropertySize); | ||
| parameters.put(hive_metastoreConstants.META_TABLE_STORAGE, HIVE_ICEBERG_STORAGE_HANDLER); |
There was a problem hiding this comment.
What if the view is the previous version of view and we store it in Hive?
There was a problem hiding this comment.
I am not sure how this question is related to the pointed location in code, but same as previous comment - when querying Iceberg logical view it's properties are retrieved from Iceberg and not from HMS.
|
|
||
| try { | ||
| if (catalog instanceof Closeable closeable) { | ||
| try (Closeable ignored = closeable) { |
There was a problem hiding this comment.
I'm pretty sure I'm missing something: what is the reason of checking if the catalog is Closeable and ignoring it?
There was a problem hiding this comment.
buildIcebergCatalog gives us a one-off catalog for this call. The Catalog interface isn’t Closeable, but some implementations are (RestCatalog is Closable) and they hold clients/IO that need to be closed when we’re done. HiveCatalog isn’t, so we only wrap when instanceof Closeable.
We are not ignoring it. The try (Closeable ignored = closeable) bit is just try-with-resources, we need a variable so close() runs at the end of the block. We don’t reference it in the body because we already use catalog for loadView. Naming it as ignored is only to signal “this exists for cleanup, not for logic.” Without it you can leak connections on every postGetTable enrich against REST.
| tblProperties == null ? Maps.newHashMap() : Maps.newHashMap(tblProperties); | ||
|
|
||
| for (Map.Entry<String, String> e : tblProps.entrySet()) { | ||
| if (e.getKey() != null && e.getValue() != null) { |
There was a problem hiding this comment.
Is it possible having table properties without having a key or value?
There was a problem hiding this comment.
I think this is not needed, removed.
| result.setTableType(TableType.EXTERNAL_TABLE.toString()); | ||
| result.setPartitionKeys(getPartitionKeys(table, table.spec().specId())); | ||
|
|
||
| // TODO: Revert after HIVE-29633 is fixed |
There was a problem hiding this comment.
Can this comment be a leftover? As I see there a toHiveView method created below. And HIVE-29633 is about views.
There was a problem hiding this comment.
No, this is intentional and agreed with Denys and Krisztian.
| PropertyUtil.propertyAsString( | ||
| metadata.properties(), HiveCatalog.HMS_TABLE_OWNER, System.getProperty("user.name")); | ||
| result.setOwner(owner); | ||
| result.setCreateTime(nowSec); |
There was a problem hiding this comment.
In Iceberg view, timestamp-ms in the view version stores the create time.
There was a problem hiding this comment.
Changed to get it from the Iceberg metadata instead.
| null); | ||
|
|
||
| // In-memory overlay for compile/describe: authoritative SQL comes from Iceberg metadata. | ||
| hmsTable.setViewOriginalText(sqlText); |
There was a problem hiding this comment.
Can it cause any trouble on Hive side of the original text is actually the expanded text?
There was a problem hiding this comment.
I've implemented a lot of test cases both on HMS and REST paths and no problems are observed.
It is taken from Iceberg metadata:
public static void applyIcebergViewToHmsTable(Table hmsTable, View view, Configuration conf) {
ViewMetadata metadata = ((BaseView) view).operations().current();
String sqlText = viewSqlText(view, metadata);
|
|
||
| private void preCreateNativeIcebergLogicalView(CreateTableRequest request) { | ||
|
|
||
| org.apache.hadoop.hive.metastore.api.Table hmsTable = request.getTable(); |
There was a problem hiding this comment.
For tables, we check if the table is already existing. Why don't we need a similar check in case of views?
| } | ||
|
|
||
| @Override | ||
| public void rollbackCreateTable(org.apache.hadoop.hive.metastore.api.Table hmsTable) { |
There was a problem hiding this comment.
What is the reason of this change and the removal of the other empty overrides?
There was a problem hiding this comment.
Code cleanup of empty overrides which were not needed..
| throw new IllegalStateException("Hive commit lock should already be set"); | ||
| } | ||
| commitLock.unlock(); | ||
| if (BaseHiveIcebergMetaHook.isNativeIcebergLogicalView(hmsTable)) { |
There was a problem hiding this comment.
I would add this part before commitLock.unlock. Yes, I know the current implementation uses NoLock for native views but it can lead to hard to detect issues in case we make a decision later to use actual locks for Iceberg views later.
|
|
||
| @Override | ||
| public boolean supportsExternalLogicalViewCatalog() { | ||
| return true; |
There was a problem hiding this comment.
Is it still true in other metastore clients than HiveRESTCatalogClient?
There was a problem hiding this comment.
Yes, this is also true for regular Hive+HMS configuration.
See iceberg_native_logical_view.q - there is no REST client in this test.
| desc formatted v_hive; | ||
| drop view v_hive; | ||
|
|
||
| ----------------------------------------------------------------------------------------- |
There was a problem hiding this comment.
I love those comments. I wish all qtests would such information.
There was a problem hiding this comment.
actually the full comment on these lines is
-----------------------------------------------------------------------------------------
-- Replace Iceberg logical view with a Hive-native logical view
-----------------------------------------------------------------------------------------
| `dept_id` bigint, | ||
| `team_id` bigint, | ||
| `company_id` bigint) | ||
| PARTITIONED BY ( |
There was a problem hiding this comment.
What is the reason of this removal?
There was a problem hiding this comment.
This is related to your previous comment: #6449 (comment)
Previously, MetastoreUtil.toHiveTable(..) was setting partition keys and this was causing failures in Iceberg views on partitioned base Iceberg tables. In this PR it was decided with Denys and Krisztian to set partition keys as empty in MetastoreUtil.toHiveTable(..) until HIVE-29633 is fixed.
// TODO: Revert after HIVE-29633 is fixed
// result.setPartitionKeys(getPartitionKeys(table, table.spec().specId()));
result.setPartitionKeys(Lists.newArrayList());
| # Detailed Table Information | ||
| Database: ice_rest | ||
| #### A masked pattern was here #### | ||
| Retention: 2147483647 |
There was a problem hiding this comment.
Is it a timestamp or some remaining time? I wonder if it would worth masking out those values or not to avoid test flakyness.
There was a problem hiding this comment.
2147483647 is Integer.MAX_VALUE - the HMS Table.retention field is set in MetastoreUtil.applyIcebergViewToHmsTable.
There was a problem hiding this comment.
I removed setting retention to match HMS behavior which sets 0 by default.
|
|
||
| CreateViewDesc desc = new CreateViewDesc(dbDotView, table.getCols(), null, table.getParameters(), | ||
| table.getPartColNames(), false, false, viewOriginalText, viewExpandedText, table.getPartCols()); | ||
| CreateViewDesc desc = new CreateViewDesc(dbDotView, table.getCols(), null, table.getParameters(), |
There was a problem hiding this comment.
Is that HMS replication?
If yes, I worry about this scenario:
- We don't have those information in HMS as we use REST client.
- We start replication.
- Replication adds those information on the target.
- Target will contain different data than source.
There was a problem hiding this comment.
We can use regular HMS client with Iceberg logical views too.
HMS still has basic info about Iceberg views, like view identity (db, view name), but all the other info is coming from Iceberg metadata at query time, so I don't think there should be an issue here.
|
On conceptional level, I have two major questions: The other important question is the boundary between what we store on HMS and what we store on Iceberg side? |
Done - previously Hive was getting the view query definition that was stored in HMS, now I made it retrieve view definition from Iceberg files (via Iceberg API) in |
Same as #6449 (comment), answered there.
HMS stores only a view identity; at the read time, we construct HMS in-memory view object that gets enriched from the Iceberg metadata (including View SQL text) on both HMS and REST client paths in |
|



What changes were proposed in this pull request?
Added support for Iceberg logical views in Hive for both HMS and REST catalogs.
Why are the changes needed?
To support Iceberg logical views. This can be especially useful for REST Catalog clients.
Does this PR introduce any user-facing change?
Yes, new HQL syntax:
tblproperties ('view-format'='iceberg'):tblproperties ('view-format'='iceberg')) and Hive confighive.default.storage.handler.classpointing to a storage handler that supports external logical views:How was this patch tested?
Created new and updated exiting unit and integration tests with Iceberg logical views test cases.