Skip to content

GH-45523: [R] Implement Utf8View type bindings#49712

Draft
thisisnic wants to merge 10 commits into
apache:mainfrom
thisisnic:GH-45523-ipc-polars
Draft

GH-45523: [R] Implement Utf8View type bindings#49712
thisisnic wants to merge 10 commits into
apache:mainfrom
thisisnic:GH-45523-ipc-polars

Conversation

@thisisnic
Copy link
Copy Markdown
Member

@thisisnic thisisnic commented Apr 11, 2026

Rationale for this change

No bindings for Utf8View type in the R package

What changes are included in this PR?

Implement bindings

Are these changes tested?

Yep

Are there any user-facing changes?

Yep, adding functionality.

AI Usage

Heavily used Codex/Claude here. I'm not confident of every line of code. I read things over, and iterated on it making sure that tests pass and nothing seemed wildly incorrect.

@thisisnic
Copy link
Copy Markdown
Member Author

We should rebase once #49710 is merged as changes from that PR are in this branch as they were needed to make it work.

Comment thread r/src/array_to_vector.cpp
Comment on lines +627 to +630
cpp11::stop(
"Cannot convert Dictionary Array of type `%s` to R: dictionary has "
"more levels than an R factor can represent",
dict_type.ToString().c_str());
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test for this

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 5, 2026
Comment on lines +829 to +830
return this->value_builder_->Append(
std::string_view(view_.bytes, static_cast<size_t>(view_.size)));
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code passed (const char*, int32_t) which matches StringBuilder::Append but not StringViewBuilder::Append (which takes int64_t). Switching to std::string_view works for both builder types.

Comment thread r/src/r_to_arrow.cpp
Comment on lines +913 to +954
template <typename T>
class RPrimitiveConverter<T, enable_if_string_view<T>>
: public PrimitiveConverter<T, RConverter> {
public:
Status Extend(SEXP x, int64_t size, int64_t offset = 0) override {
RVectorType rtype = GetVectorType(x);
if (rtype != STRING) {
return Status::Invalid("Expecting a character vector");
}
return UnsafeAppendUtf8Strings(arrow::r::utf8_strings(x), size, offset);
}

void DelayedExtend(SEXP values, int64_t size, RTasks& tasks) override {
auto task = [this, values, size]() { return this->Extend(values, size); };
tasks.Append(false, std::move(task));
}

private:
Status UnsafeAppendUtf8Strings(const cpp11::strings& s, int64_t size, int64_t offset) {
RETURN_NOT_OK(this->primitive_builder_->Reserve(size - offset));
const SEXP* p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset;

int64_t total_length = 0;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
total_length += si == NA_STRING ? 0 : LENGTH(si);
}
RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length));

p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
if (si == NA_STRING) {
this->primitive_builder_->UnsafeAppendNull();
} else {
this->primitive_builder_->UnsafeAppend(CHAR(si), LENGTH(si));
}
}

return Status::OK();
}
};
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm less confident about this entire block of code

@thisisnic
Copy link
Copy Markdown
Member Author

A lot of .Rd updates as I used a newer roxygen2 version.

Copilot AI review requested due to automatic review settings May 28, 2026 09:37
@thisisnic thisisnic force-pushed the GH-45523-ipc-polars branch from de59365 to cf9ad10 Compare May 28, 2026 09:37
@github-actions github-actions Bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds R bindings and conversion support for Arrow utf8_view / string_view, addressing IPC/table conversion failures for data containing StringView columns.

Changes:

  • Adds string_view() R data type binding, export registration, and type tests.
  • Implements R↔Arrow conversion paths for StringView arrays and dictionary values.
  • Updates generated documentation and related converter support in C++/Python.

Reviewed changes

Copilot reviewed 13 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
r/R/type.R Adds StringView R6 type, constructor, and canonical type aliases.
r/R/arrowExports.R Registers R wrapper for StringView__initialize.
r/NAMESPACE Exports string_view.
r/src/datatype.cpp Maps Arrow STRING_VIEW to R StringView and initializes utf8_view.
r/src/r_to_arrow.cpp Adds R-to-Arrow StringView conversion and dictionary StringView value handling.
r/src/array_to_vector.cpp Adds Arrow-to-R StringView and wider dictionary index conversion support.
r/src/arrowExports.cpp Registers native StringView initialization entry point.
cpp/src/arrow/util/converter.h Enables dictionary converters for StringViewType.
python/pyarrow/src/arrow/python/python_to_arrow.cc Adjusts Python dictionary StringView append call.
r/tests/testthat/test-Array.R Adds StringView array round-trip tests.
r/tests/testthat/test-Table.R Adds table/dictionary StringView and wider index tests.
r/tests/testthat/test-data-type.R Adds StringView data type and code round-trip tests.
r/DESCRIPTION Updates roxygen metadata.
r/man/data-type.Rd Adds string_view() documentation entry.
r/man/acero.Rd Regenerated Acero documentation.
r/man/arrow-package.Rd Regenerated package author documentation.
r/man/csv_convert_options.Rd Regenerated CSV conversion docs.
r/man/csv_read_options.Rd Regenerated CSV read docs.
r/man/CsvReadOptions.Rd Regenerated CSV read options docs.
r/man/enums.Rd Regenerated enum documentation.
r/man/JsonFileFormat.Rd Regenerated JSON file format docs.
r/man/reexports.Rd Regenerated reexports docs.
r/man/vctrs_extension_array.Rd Regenerated vctrs extension docs.
Files not reviewed (10)
  • r/man/CsvReadOptions.Rd: Language not supported
  • r/man/JsonFileFormat.Rd: Language not supported
  • r/man/acero.Rd: Language not supported
  • r/man/arrow-package.Rd: Language not supported
  • r/man/csv_convert_options.Rd: Language not supported
  • r/man/csv_read_options.Rd: Language not supported
  • r/man/data-type.Rd: Language not supported
  • r/man/enums.Rd: Language not supported
  • r/man/reexports.Rd: Language not supported
  • r/man/vctrs_extension_array.Rd: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread r/src/r_to_arrow.cpp
Comment on lines +935 to +947
int64_t total_length = 0;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
total_length += si == NA_STRING ? 0 : LENGTH(si);
}
RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length));

p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
if (si == NA_STRING) {
this->primitive_builder_->UnsafeAppendNull();
} else {
Comment thread r/man/acero.Rd Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 28, 2026 12:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 23 changed files in this pull request and generated 1 comment.

Files not reviewed (10)
  • r/man/CsvReadOptions.Rd: Language not supported
  • r/man/JsonFileFormat.Rd: Language not supported
  • r/man/acero.Rd: Language not supported
  • r/man/arrow-package.Rd: Language not supported
  • r/man/csv_convert_options.Rd: Language not supported
  • r/man/csv_read_options.Rd: Language not supported
  • r/man/data-type.Rd: Language not supported
  • r/man/enums.Rd: Language not supported
  • r/man/reexports.Rd: Language not supported
  • r/man/vctrs_extension_array.Rd: Language not supported

Comment thread r/src/r_to_arrow.cpp
Comment on lines +935 to +947
int64_t total_length = 0;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
total_length += si == NA_STRING ? 0 : LENGTH(si);
}
RETURN_NOT_OK(this->primitive_builder_->ReserveData(total_length));

p_strings = reinterpret_cast<const SEXP*>(DATAPTR_RO(s)) + offset;
for (R_xlen_t i = offset; i < size; i++, ++p_strings) {
SEXP si = *p_strings;
if (si == NA_STRING) {
this->primitive_builder_->UnsafeAppendNull();
} else {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants