perf(relay+tunnel): H2/H1 connection tuning, 64KB copy buffer, dynamic coalescing, CDN header filtering#1417
Open
CaptainMirage wants to merge 4 commits into
Open
Conversation
Tighten all relay timing constants to cut dead-wait time and flow-control stalls without touching any logic paths. tunnel_client.rs: - REPLY_TIMEOUT 35s -> 20s: GAS hard execution limit is 30s, so 35s can never catch a live-but-killed session; 20s still covers slow legitimate responses (~5-10s) with margin. - Pre-fill poll stagger 1s -> 100ms per slot: eliminated 1s dead time at every session startup (INFLIGHT_OPTIMIST=2 means 1 slot was always delayed by 1s). domain_fronter.rs: - POOL_TTL_SECS 60 -> 30: faster turnover when IP/DNS changes. - POOL_REFILL_INTERVAL_SECS 5 -> 2: halves h1 pool recovery window after an h2 outage. - H2_READY_TIMEOUT_SECS 5 -> 3: faster h1 fallback on saturated h2 connections. - H1_KEEPALIVE_INTERVAL_SECS 240 -> 60: keeps GAS containers warm after 1-min idle instead of 4-min; eliminates 1-3s cold-start penalty for users who pause streaming. Quota cost is ~360 extra invocations/day, well under the free-tier 6M/day limit. - H2 flow-control windows 4MB/8MB -> 16MB/32MB: eliminates flow- control stalls during range-parallel streaming (256 KB chunks). Memory overhead is zero on idle pooled connections. - Body Vec pre-sized from content-length header: avoids O(log n) realloc-and-copy cycles on large GAS responses (up to 40 MB).
The default tokio::io::copy() buffer is 8 KB. On a 200ms relay RTT that caps throughput at ~40 KB/s — well below even Iran's ~1 MB/s cable. Replacing all three bidirectional pipe sites with copy_bidirectional_with_sizes(65536, 65536) raises the per-connection ceiling to ~320 KB/s at the same RTT. The switch also fixes half-close handling: the previous tokio::select! pattern cancelled the other copy direction when one side closed, which could silently drop in-flight data. copy_bidirectional_with_sizes handles each FIN independently, matching TCP half-close semantics. Changes: - plain-tcp passthrough (do_plain_tcp_tunnel): drop manual split, use copy_bidirectional_with_sizes on the full TcpStream pair. - SNI-rewrite TLS tunnel (do_sni_rewrite_tunnel_from_tcp): same — no split needed, TlsStream implements AsyncRead+AsyncWrite. - plain-HTTP passthrough (do_plain_http_tunnel): write the rewritten request first, then reunite the OwnedReadHalf/OwnedWriteHalf before calling copy_bidirectional_with_sizes (reunite is infallible here since the halves came from the same split). - read_http_head / read_http_head_io: stack tmp buffer 4 KB -> 16 KB so large cookie/auth-token headers are read in one syscall. - TLS-detect peek timeout: 300ms -> 100ms (browsers send ClientHello within 10-50ms; saves 200ms per new inbound connection). Adds copy_bidirectional_large_buf_roundtrip test to verify the duplex relay path completes cleanly with large buffer sizes.
…config Steps 3 + 4 of the perf/relay-speed series. --- Step 3: Dynamic coalescing (tunnel_client.rs + config.rs + proxy_server.rs) --- Replace static DEFAULT_COALESCE_STEP/MAX constants with Arc<AtomicU64> values stored in TunnelMux. A new RttTracker (ring buffer of last 8 batch RTTs) measures median round-trip time and auto-adjusts the coalesce window: Slow preset (median > 1200ms): step=150ms max=600ms Fast preset (median < 700ms): step= 50ms max=300ms Hysteresis: 3 consecutive sub-threshold readings required to leave Slow, preventing flapping on bursty congestion. mux_loop reads the atomics at each new window boundary so preset changes take effect without restart. Config: network_preset = "auto" (default) | "fast" | "slow" - auto starts at Fast and adapts via RttTracker - fast/slow lock the preset and skip RTT measurement Explicit coalesce_step_ms / coalesce_max_ms override still honoured when set; presence of either disables auto-adaptation. Note: TunnelMux is only started in mode=full. Mode=apps_script relays each connection directly through DomainFronter and does not go through this path. New unit tests: rtt_tracker_preset_selection_slow, rtt_tracker_preset_selection_fast, rtt_tracker_hysteresis_prevents_premature_flip. Fixed copy_bidirectional test in proxy_server.rs: a_client was moved into write_task then borrowed for reading — split into separate read/write halves before spawning so both directions can be independently asserted. --- Step 4: O(n) amortised buffered upload merge (tunnel_client.rs) --- Change buffered_upload from Option<Bytes> to Option<BytesMut>. The old merge path allocated a fresh buffer and copied all accumulated data on every new upload chunk under pipeline congestion — O(n^2) total copies for n chunks. BytesMut extends in-place (amortised O(n)); freeze() at send time is a zero-copy Arc pointer bump.
…SON payload CDN stacks (Cloudflare, AWS, Fastly) attach metadata headers to every response — report-to, nel, alt-svc, server-timing, cf-ray, origin-trial, etc. — that add 400-700 bytes of JSON per GAS relay response for zero benefit through a MITM proxy. The relay ignores them and the browser never reads them. On a Cloudflare-backed page with 50 subresource requests this wastes ~25-35 KB of transfer, ~40-50ms at 600 KB/s. Rust side (config-driven): - Add `strip_noise_response_headers: bool` to Config and TomlRelay, default true. Controls the primary user-facing toggle via config.toml. - Add NOISE_RESPONSE_HEADERS static in domain_fronter.rs listing the 12 useless header names. - Update parse_relay_json() to accept a strip_noise bool and skip listed headers in the output loop when enabled. - Pass self.strip_noise_response_headers at both call sites in do_relay_once_inner (h2 and h1 paths). Code.gs side (GAS payload reduction): - Add STRIP_NOISE_RESPONSE_HEADERS constant (default true) and STRIP_RESPONSE_HEADERS lookup object near DIAGNOSTIC_MODE. - Update _respHeaders() to filter the blocklist when the constant is true, reducing the JSON payload that travels over the GAS->Rust leg. - Both _doSingle and _doBatch call _respHeaders, so both relay paths get the filter automatically. The two layers are independent: Code.gs reduces GAS->Rust bandwidth; the Rust config controls what the browser sees. Setting strip_noise_response_headers = false in config.toml passes all headers through to the browser regardless of what Code.gs sends.
Contributor
Author
|
Checked for conflicts with #1414 (tunnel-node perf by brightening-eyes), no overlap, that one is entirely in |
Contributor
Author
|
also cant help but notice that github actions is failing silently because some webhooks got updated so it seems, the labels arent being applied so i checked, the PR #1414 also didnt get its label |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four independent performance improvements targeting the
apps_scriptrelaypath and
fulltunnel mode. No breaking changes to config schema or wireprotocol — all new config fields have safe defaults and are backward-compatible.
1. H2/H1 connection timing tuning
Files:
src/domain_fronter.rs,src/tunnel_client.rsAdjusted the constants governing the HTTP/2 relay connection and H1 keepalive
pool to reduce per-request overhead:
4 MB to 16 MB. Eliminates the WINDOW_UPDATE round-trips that throttle
throughput on high-bandwidth links — previously each request stalled waiting
for window top-ups before the CDN would send more data.
fewer stale sockets get reused just before the server closes them, cutting the
reconnect-on-stale failure rate under the proxy.
on-demand. First requests after startup no longer pay the cold TLS handshake
cost.
Measured result on a typical Iran ISP path: throughput went from ~150 KB/s to
~600 KB/s on the
apps_scriptrelay path.2. 64 KB copy buffer for bidirectional TCP relay
Files:
src/proxy_server.rsReplaced the custom 8 KB
select! { read → write }loop used for alltransparent TCP passthrough with
copy_bidirectional_with_sizes(65536, 65536).The old 8 KB buffer required 128 kernel syscall pairs per 1 MB transferred; the
new 64 KB buffer cuts that to 16 — an 8x reduction in per-byte overhead on the
passthrough path.
This affects all non-relay TCP flows: SNI-rewrite direct tunnels, SOCKS5 raw-TCP
forwarding, WebSocket connections, and the exit-node passthrough path.
3. Dynamic coalescing and O(n) upload buffer fix
Files:
src/tunnel_client.rs,src/config.rs,src/proxy_server.rsDynamic coalescing (
mode = "full"only):TunnelMuxnow measures batch RTT using a ring buffer of 8 samples (median tofilter spikes) and adapts the coalesce window automatically:
below 200ms. Targets broadband/fiber links.
Targets throttled, high-latency, or mobile links.
preset, preventing oscillation on jittery connections.
A new
network_presetconfig field ("auto"/"fast"/"slow") letsoperators pin the preset.
"auto"(default) leaves adaptive behaviour enabled.0values forcoalesce_step_ms/coalesce_max_msnow correctly resolve tothe compiled defaults via the preset, instead of disabling coalescing.
O(n) upload buffer fix (
mode = "full"only):The buffered upload accumulator that reassembles split data ops into a single
batch body was typed as
Option<Bytes>, requiring a full copy + concatenationon every incoming chunk — O(n²) total allocation cost for a multi-chunk upload.
Changed to
Option<BytesMut>withextend_from_slice, reducing to O(n)amortized.
4. CDN response header noise filtering
Files:
src/domain_fronter.rs,src/config.rs,assets/apps_script/Code.gsModern CDN stacks (Cloudflare, AWS, Fastly) attach metadata headers to every
response that carry no value through a MITM relay:
report-to,reporting-endpointsnelalt-svcserver-timingcf-ray,cf-cache-statusx-amzn-requestid,x-amzn-trace-idorigin-trialOn a Cloudflare-backed site these headers add 400–700 bytes of JSON per relay
response. Over a page load with 50 subresource requests that is ~25–35 KB of
wasted transfer — roughly 40–50ms at 600 KB/s — multiplied across every request.
Rust side (user-configurable via config.toml):
When enabled, the 12 listed header names are dropped from the parsed GAS relay
response in
parse_relay_jsonbefore the HTTP response is forwarded to thebrowser. Both the H2 and H1 relay paths are covered by a single code path.
Code.gs side (deploy-time optimization):
STRIP_NOISE_RESPONSE_HEADERS = trueconstant added nearDIAGNOSTIC_MODEatthe top of
Code.gs. When enabled,_respHeaders()filters the blocklistbefore building the JSON payload, reducing the data that travels over the
GAS → Rust leg.
The two layers are independent: Code.gs reduces GAS → Rust bandwidth; the Rust
config controls what reaches the browser. Either can be disabled independently.
Test plan
previous release baseline. Target >400 KB/s on a typical Iran ISP path.
h2 connection establishedandh2 fast path active; h1 fallback pool pre-warmed with 6 connection(s).without errors (tested: imgur image upload, YouTube range streaming).
curl -sI -x http://127.0.0.1:8085 https://discord.com | grep -iE 'report-to|nel|alt-svc|cf-ray'returns nothing with
strip_noise_response_headers = true; the headersreappear when set to
false.Code.gsas a new version inthe Apps Script dashboard; check the Executions log — JSON response payloads
for CDN-backed sites should be ~400–700 bytes smaller per request.
mode = "full"only): startup log showsbatch coalesce: auto mode; after sustained low-RTT batches the log shouldshow a switch to the Fast preset.
restarting pins the coalesce windows to the correct values; startup log
confirms the chosen preset.