GH-45523: [R] Implement Utf8View type bindings#49712
Conversation
|
We should rebase once #49710 is merged as changes from that PR are in this branch as they were needed to make it work. |
e8768d0 to
bb711a7
Compare
| cpp11::stop( | ||
| "Cannot convert Dictionary Array of type `%s` to R: dictionary has " | ||
| "more levels than an R factor can represent", | ||
| dict_type.ToString().c_str()); |
There was a problem hiding this comment.
We should add a test for this
| return this->value_builder_->Append( | ||
| std::string_view(view_.bytes, static_cast<size_t>(view_.size))); |
There was a problem hiding this comment.
The old code passed (const char*, int32_t) which matches StringBuilder::Append but not StringViewBuilder::Append (which takes int64_t). Switching to std::string_view works for both builder types.
| template <typename T> | ||
| class RPrimitiveConverter<T, enable_if_string_view<T>> | ||
| : public PrimitiveConverter<T, RConverter> { | ||
| public: | ||
| Status Extend(SEXP x, int64_t size, int64_t offset = 0) override { | ||
| RVectorType rtype = GetVectorType(x); | ||
| if (rtype != STRING) { | ||
| return Status::Invalid("Expecting a character vector"); | ||
| } | ||
| return UnsafeAppendUtf8Strings(arrow::r::utf8_strings(x), size, offset); | ||
| } | ||
|
|
||
| void DelayedExtend(SEXP values, int64_t size, RTasks& tasks) override { | ||
| auto task = [this, values, size]() { return this->Extend(values, size); }; | ||
| tasks.Append(false, std::move(task)); | ||
| } | ||
|
|
||
| private: | ||
| Status UnsafeAppendUtf8Strings(const cpp11::strings& s, int64_t size, int64_t offset) { | ||
| RETURN_NOT_OK(this->primitive_builder_->Reserve(size - offset)); | ||
| const SEXP* p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset; | ||
|
|
||
| int64_t total_length = 0; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| total_length += si == NA_STRING ? 0 : LENGTH(si); | ||
| } | ||
| RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length)); | ||
|
|
||
| p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| if (si == NA_STRING) { | ||
| this->primitive_builder_->UnsafeAppendNull(); | ||
| } else { | ||
| this->primitive_builder_->UnsafeAppend(CHAR(si), LENGTH(si)); | ||
| } | ||
| } | ||
|
|
||
| return Status::OK(); | ||
| } | ||
| }; |
There was a problem hiding this comment.
I'm less confident about this entire block of code
|
A lot of |
de59365 to
cf9ad10
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds R bindings and conversion support for Arrow utf8_view / string_view, addressing IPC/table conversion failures for data containing StringView columns.
Changes:
- Adds
string_view()R data type binding, export registration, and type tests. - Implements R↔Arrow conversion paths for StringView arrays and dictionary values.
- Updates generated documentation and related converter support in C++/Python.
Reviewed changes
Copilot reviewed 13 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| r/R/type.R | Adds StringView R6 type, constructor, and canonical type aliases. |
| r/R/arrowExports.R | Registers R wrapper for StringView__initialize. |
| r/NAMESPACE | Exports string_view. |
| r/src/datatype.cpp | Maps Arrow STRING_VIEW to R StringView and initializes utf8_view. |
| r/src/r_to_arrow.cpp | Adds R-to-Arrow StringView conversion and dictionary StringView value handling. |
| r/src/array_to_vector.cpp | Adds Arrow-to-R StringView and wider dictionary index conversion support. |
| r/src/arrowExports.cpp | Registers native StringView initialization entry point. |
| cpp/src/arrow/util/converter.h | Enables dictionary converters for StringViewType. |
| python/pyarrow/src/arrow/python/python_to_arrow.cc | Adjusts Python dictionary StringView append call. |
| r/tests/testthat/test-Array.R | Adds StringView array round-trip tests. |
| r/tests/testthat/test-Table.R | Adds table/dictionary StringView and wider index tests. |
| r/tests/testthat/test-data-type.R | Adds StringView data type and code round-trip tests. |
| r/DESCRIPTION | Updates roxygen metadata. |
| r/man/data-type.Rd | Adds string_view() documentation entry. |
| r/man/acero.Rd | Regenerated Acero documentation. |
| r/man/arrow-package.Rd | Regenerated package author documentation. |
| r/man/csv_convert_options.Rd | Regenerated CSV conversion docs. |
| r/man/csv_read_options.Rd | Regenerated CSV read docs. |
| r/man/CsvReadOptions.Rd | Regenerated CSV read options docs. |
| r/man/enums.Rd | Regenerated enum documentation. |
| r/man/JsonFileFormat.Rd | Regenerated JSON file format docs. |
| r/man/reexports.Rd | Regenerated reexports docs. |
| r/man/vctrs_extension_array.Rd | Regenerated vctrs extension docs. |
Files not reviewed (10)
- r/man/CsvReadOptions.Rd: Language not supported
- r/man/JsonFileFormat.Rd: Language not supported
- r/man/acero.Rd: Language not supported
- r/man/arrow-package.Rd: Language not supported
- r/man/csv_convert_options.Rd: Language not supported
- r/man/csv_read_options.Rd: Language not supported
- r/man/data-type.Rd: Language not supported
- r/man/enums.Rd: Language not supported
- r/man/reexports.Rd: Language not supported
- r/man/vctrs_extension_array.Rd: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int64_t total_length = 0; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| total_length += si == NA_STRING ? 0 : LENGTH(si); | ||
| } | ||
| RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length)); | ||
|
|
||
| p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| if (si == NA_STRING) { | ||
| this->primitive_builder_->UnsafeAppendNull(); | ||
| } else { |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 23 changed files in this pull request and generated 1 comment.
Files not reviewed (10)
- r/man/CsvReadOptions.Rd: Language not supported
- r/man/JsonFileFormat.Rd: Language not supported
- r/man/acero.Rd: Language not supported
- r/man/arrow-package.Rd: Language not supported
- r/man/csv_convert_options.Rd: Language not supported
- r/man/csv_read_options.Rd: Language not supported
- r/man/data-type.Rd: Language not supported
- r/man/enums.Rd: Language not supported
- r/man/reexports.Rd: Language not supported
- r/man/vctrs_extension_array.Rd: Language not supported
| int64_t total_length = 0; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| total_length += si == NA_STRING ? 0 : LENGTH(si); | ||
| } | ||
| RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length)); | ||
|
|
||
| p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset; | ||
| for (R_xlen_t i = offset; i < size; i++, ++p_strings) { | ||
| SEXP si = *p_strings; | ||
| if (si == NA_STRING) { | ||
| this->primitive_builder_->UnsafeAppendNull(); | ||
| } else { |
Rationale for this change
No bindings for Utf8View type in the R package
What changes are included in this PR?
Implement bindings
Are these changes tested?
Yep
Are there any user-facing changes?
Yep, adding functionality.
AI Usage
Heavily used Codex/Claude here. I'm not confident of every line of code. I read things over, and iterated on it making sure that tests pass and nothing seemed wildly incorrect.