bend is a Lumbda primitive that decides per call whether to evaluate locally or ship to a CUDA worker over our wire protocol. Tiny inputs stay local; heavy inputs bend to a worker that holds a warm CUDA context across requests. The decision uses a cost estimator on the argument shape, not the operation name.
Start a GPU worker
# On any host with nvcc + a CUDA-capable GPU:
make gpu-worker
# → builds examples/cuda-fanout/shake256-fanout
# → builds the C tier (~10× faster wire orchestration than Python)
# → launches gpu-worker.lsp on TWO ports:
# 8320 — wire-TCP protocol (native callers: bend.lsp, asm clients)
# 8321 — HTTP/1.1 + CORS (browser callers: the playground/REPL tabs)
# Port 8320 = BEND mnemonic:
# 8 ~= B (implied infinity B flattened; bake a cake; baby & me)
# 3 ~= E (backward)
# 2 ~= N (pivoted 90 degrees)
# 0 ~= D (flattened)
# Port 8321 = 8320 + 1 (HTTP is always wire-port + 1, override via --http-port).
# Override tier (default port stays 8320 / BEND):
make gpu-worker LUMBDA=python # easier debugging
make gpu-worker LUMBDA=asm # smallest footprint
# Override port too (only when running a second worker on the same host):
make gpu-worker LUMBDA=python PORT=9320 # second worker (wire 9320, http 9321)
Run your own bend → use it from a browser tab
Browsers cannot open raw TCP sockets, but they can fetch / XHR to http://localhost:<port> even from an HTTPS page (a permissive "potentially trustworthy" exception every major browser ships). So the playground field accepts a URL that points at a bend worker running on your own machine. No fox-owned ingress, no certificate, no proxy hop.
# 1. Start a worker on your laptop / desktop / homelab
git clone https://lumbda.com/lumbda.git && cd lumbda
make gpu-worker LUMBDA=python # CPU works; CUDA path lights up if nvcc is present
# → "gpu-worker listening on port 8320"
# → "gpu-worker HTTP listening on port 8321"
# 2. In a separate shell, smoke-test the HTTP path
curl -sS -X POST -H 'Content-Type: text/plain' \
--data-binary '(ping)' http://localhost:8321/
# → (ok pong)
# 3. Open https://lumbda.com/playground/ and paste into the ⚡ bend field:
# http://localhost:8321/
# Save, then in the REPL:
# λ> (bend!-call '(ping))
Both ports share the same handle-request dispatcher: whatever an op-head (ping, echo, cuda-shake-fanout, cuda-sim-ops-bin, …) returns over wire is what HTTP returns. Adding a new op-head exposes it on both transports automatically.
HTTP scope today is the S-expression text path — for browser-friendly recipes the server understands today, with no binary blobs crossing the wire. Binary modes (BSHK / BCGB / BSCP / BSRT / BSB3) stay wire-only because they exist for native callers who already hold the binary locally.
Phase 2 (deferred until authentication lands): upload an .lsp recipe describing a champion circuit, dispatch into a server-side factory that compiles the .bin on the worker's filesystem and bends it. Same playground field, same single endpoint. Auth gates write access; today a worker on the public internet would let any caller occupy your GPU.
Call it from any tier
;; bend works on every tier — Python, C, asm — through the same
;; tcp-* + portal primitives lumbda already ships.
(load "examples/cuda-fanout/wire.lsp")
(load "examples/cuda-fanout/bend.lsp")
(load "examples/cuda-fanout/bend-macros.lsp") ; Python/C only — asm uses bend-call
;; Tiny — cost below threshold, evaluates locally
(bend (cuda-shake-fanout '("00" "01" "deadbeef") 32))
;; Heavy — cost above threshold, ships to the GPU worker
(bend (cuda-shake-fanout one-million-inputs 32))
Wire protocol
Two modes: S-expression text (the default) and binary (magic BSHK header + raw bytes). Binary mode bypasses S-expression parsing entirely.
| workload | Py S-exp | Py binary | C S-exp | C binary |
|---|---|---|---|---|
| 100 × 16 B | 3.43 ms | 0.74 ms | 0.40 ms | 0.15 ms |
| 1k × 16 B | 23.24 ms | 0.76 ms | 2.77 ms | 0.22 ms |
| 10k × 16 B | 218.82 ms | 1.27 ms | CLIFF | 0.88 ms |
| 100k × 16 B | 2,219 ms | 10.18 ms | CLIFF | 10.35 ms |
| 1M × 16 B | 23,811 ms | 159 ms | CLIFF | 157 ms |
Binary mode wins by 30–200× over S-expression at scale; at 1 M × 16 B inputs C tier binary is 157 ms vs 23,811 ms for S-exp, and bend beats host hashlib by ~12×. The CUDA toolchain stays isolated to the leaf binary the worker spawns — no tier links libcudart; asm tier hosts workers through hand-written pipe2 + fork + execve syscalls.
Fleet
Production default: single host. Multi-host fan-out is built (round-robin in bend.lsp via *bend-workers* + BEND_WORKERS env), available on demand for long-running parallel workloads — not used routinely.
| host | GPU | arch | port | status |
|---|---|---|---|---|
3090-ai.foxhop.net | RTX 3090 (24 GB) | sm_86 | 8320 | active — production worker |
ai.foxhop.net | RTX 4090 (24 GB) | sm_89 | 8320 | active — 3-node mesh |
cammy.foxhop.net | Tesla P40 (24 GB) | sm_61 | 8320 | active — 3-node mesh (Pascal) |
3-node mesh active. 3090 + 4090 + P40 all serve bend workloads on port 8320 (BEND mnemonic). Round-robin or explicit selection via (bend-set-workers! '(("3090-ai.foxhop.net" . 8320) ("ai.foxhop.net" . 8320) ("cammy.foxhop.net" . 8320))). Cost-routed coordinator pattern: greedy bin-pack candidates by predicted cost; per-host wall factor 1.00 / 0.51 / 1.84 (3090 / 4090 / P40 — Pascal lags but contributes a third concurrent stream); one worker thread per node fires concurrently against our wire protocol. Each node serializes its own queue (workers do not multiplex requests cleanly inside one CUDA context). Full hardware table, bench numbers, & LLM-coexistence detail at foxhop gpu-mesh.
Worker health heartbeat
Every dispatch goes through a lazy-refresh health probe so bend never wastes a payload on a worker that is down, overloaded, or VRAM-starved. The worker exposes a (health) op returning measured numbers; the client caches results & ranks healthy peers by free VRAM descending.
Worker side — (health)
; over the same wire as any other op:
(health)
→ (ok (load-avg 0.42)
(vram-free-mb 22777)
(uptime-ms 1780752288371))
load-avg— 1-minute load average from/proc/loadavgvram-free-mb— free VRAM fromnvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits;0on a host without an NVIDIA GPUuptime-ms— current wallclock (millisecond), so a worker that hangs & restarts between probes shows a uptime jump
Backward compatible: older workers without a (health) handler return (error (unknown-op health)) — the client treats that as ok with vram-free-mb=0 so a pre-heartbeat build still ranks as available.
Client side — *worker-health* cache
An alist keyed by "host:port" maps each known peer to a record (last-checked-ms status vram-mb) where status ∈ {ok, down}.
- Cache TTL on
ok: 5 s. A fresh probe runs only when a cache entry ages past this; idle bursts never re-probe. - Cooldown on
down: 30 s. A worker that fails an(health)call or a regular dispatch gets skipped for half a minute before a re-probe, so a transient failure costs at most one call. - VRAM-ranked pick. When more than one peer caches as
ok,bend-pick-workersorts byvram-free-mbdescending & returns the top entry. Long-running jobs route to whichever box carries the most spare GPU memory. - Round-robin fallback. When every worker is in down-cooldown, the client falls back to plain round-robin so a recovering box still gets a real attempt rather than refusing dispatch outright.
Failure handling
Mid-flight failures in bend-dispatch-to-gpu — tcp-connect raising on a connect-refused, the worker dropping a connection mid-reply, a malformed reply — flip the worker to down. The next pick skips it for the cooldown window. A single failed call never aborts a multi-worker iteration: probes wrap in with-exception-handler so a dead peer surfaces as 'transport-fail rather than propagating up.
Form catalog
A form earns a slot here only after we have published a benchmark or measured one on our hardware. "I think this would be fast" does not earn a slot — the form-status column says planned until numbers exist.
Status legend
live— binary built, worker dispatches it, numbers recorded on our hardwaresurveyed— published benchmark cited, prototype binary not yet wrapped
Live forms
| form | hardware | throughput | wire shape |
|---|---|---|---|
cuda-shake-fanout |
RTX 3090 | 12× host hashlib at 1 M × 16 B inputs | (cuda-shake-fanout '(hex ...) out-bytes) + BSHK binary |
cuda-sim-ops-bin |
RTX 3090 | 1.27× at 141 batches (9024 shots); crossover ~115 batches; SHAKE-RNG mode matches upstream Pareto bit-for-bit (Σ Toffoli = 15,999,651,264, avg Toffoli = 1,773,011.000 on stock ops.bin). Kernel wall scales linearly with Σ Toffoli across circuit variants — foxhop secp256k1 lever sweep at p=251 (n+1=9 secp256k1-toy) over 6 lumbda-emitted ops.bin variants ran 9.83× wall reduction Fermat-schoolbook → refined-Solinas, 261.4 ms → 26.6 ms / batch at 1024 shots; matches 6.95× Σ Toffoli ratio + parallel Clifford drop. No scheduler surprise: pick any QECCOPS1 circuit, predict GPU wall from a cheap CPU op-counter. |
(cuda-sim-ops-bin "path/to/ops.bin" n-batches) — demo_ops defaults --rng-mode shake (Fiat-Shamir over op stream); --rng-mode lfsr keeps legacy xorshift for debug parity |
cuda-sim-axis-flip |
RTX 3090 | 217 Mops/s @ K=32 M=4 (many-candidates × few-shots) | (cuda-sim-axis (variant-paths ...) n-shots) |
cuda-bignum-cgbn |
RTX 3090 | 1.28 Gops/s kernel mod-mul @ n=1M (256-bit, ~256× GMP CPU); 9 ops | BCGB binary: op_id + bitwidth + n + modulus + a + b |
cuda-secp256k1-batched-mul |
RTX 3090 | 13.83 Mkeys/s @ n=1M (~309× coincurve CPU; windowed-G ladder w=4) | BSCP binary: scalars + base-point → BSCR points |
cuda-radix-sort |
RTX 3090 | 5.50 Gkeys/s @ n=10M (kernel 1.82 ms); 3.77 Gkeys/s @ n=1M; CUB DeviceRadixSort u64 ascending | BSRT binary: op_id + n + u64[n] → BSRR sorted u64[n] |
cuda-blake3-tree |
RTX 3090 | 32.5 GB/s @ 1M × 64 B (kernel 1.97 ms); 20.4 GB/s @ 1k × 1 MB (kernel 51.5 ms); byte-identical to BLAKE3 reference spec across single-chunk + multi-chunk paths | BSB3 binary: out_bytes + n + (u32 len + bytes) per input → BSR3 n + out_bytes + digests |
Surveyed forms (Wave 1)
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-secp256k1-batched-mul | 6.5 Gkeys/s VanitySearch RTX 4090; 8.6 Gkeys/s RTX 5090; 2.65 Gkeys/s RTX 3080 | (promoted — see Live) | gECC · Bitcrack |
cuda-bignum-cgbn | 100×+ on dense mul vs Xeon-20c + GMP + OpenMP | V100 | NVlabs CGBN · midsize-int |
cuda-rho-pollard-walk | 87.7 M ops/sec for ECCp79 | RTX 2070 Super | atlomak · oritwoen |
cuda-clifford-stabilizer | 186× over Stim (CPU SOTA) on equivalence-checking | STABSim | STABSim · Qimax |
cuda-bernstein-yang-inv | 3–10× per inversion over Fermat on CPU; novel territory on GPU | — | safegcd · Jumping |
cuda-ntt-poly | up to 123× over CPU; 21× on RTX 3070 | RTX 3070 | NTTSuite · FHE NTT |
cuda-radix-sort | 1.4 G keys/sec; 20–50× over CPU merge sort; 257× vs Xeon Phi for scan | (promoted — see Live) | CUB · Onesweep |
Surveyed forms (Wave 2)
Sorted by reported speedup descending.
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-minhash-weighted | 600–1000× vs numpy+MKL | Titan X vs Xeon E5-1650 | src-d/minhashcuda |
cuda-cuckoo-filter | 378× insert, 258× delete | A100 | arXiv:2603.15486 |
cuda-aes-ctr-chacha20 | 211–400 GB/s | single GPU | AsyncGBP |
cuda-suffix-array-skew | 30–242× vs CPU SA-IS | Tesla K20 | Liu/Luo |
cuda-kdtree-build | 30–242× build, 1.6–200× kNN | RTX (RT cores) | Zhou et al. |
cuda-sat-paraFROST-elim | 93× peak, 48× avg variable elim | NVIDIA + Kissat baseline | ParaFROST |
cuda-aho-corasick-pfac | ~50–100× IDS pkt-inspect | GTX-class | PFAC |
cuda-dilithium-pqsig | 57.7× keygen+sign+verify vs single CPU thread | RTX 3090 Ti | IACR 2024/1365 |
cuda-mc-options-pricing | 25–152× | Tesla C1060 / modern | GPU Gems Ch.45 |
cuda-cuFFT-batched-1D | 8–32× vs MKL; tcFFT 1.1–3.2× vs cuFFT | V100 / A100 | tcFFT |
cuda-blake3-tree | 32.5 GB/s @ 1M × 64 B; 20.4 GB/s @ 1k × 1 MB on 3090 | (promoted — see Live) | Blaze-3 |
cuda-bloom-filter-modern | ~6× CPU; 3.4 B inserts/s | B200 / Perlmutter | arXiv:2512.15595 |
cuda-gemm-batched-FP8 | 4.8× FP8 vs A100; 716 TFLOPS H100 | H100 SXM | cuBLAS 12.0 |
cuda-batched-matrix-inverse | 4.3–16.8× vs MAGMA | P100 | Superfri 2018 |
cuda-hash-join-radix | 4 B tuples/s single; 1.8 T tuples/s on 1024 A100 | A100 cluster | ADMS-21 |
cuda-kmer-count | 4–6× vs KMC2 | RapidGKC, Gerbil | RapidGKC |
cuda-cgraph-traversal | 38 B TEPS | DGX2 | cuGraph |
cuda-triangle-count-TRUST | ~1 T TEPS | multi-A100 | TRUST |
cuda-ldpc-bp-decoder | 10 Gbps with early-termination | GPGPU | MDPI Electronics 2022 |
cuda-nvcomp-zstd | 2.2× zstd; 1.4× LZ4; 1.9× snappy | H100 / A100 | nvCOMP |
Surveyed forms (Wave 3)
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-fluidx3d-lbm | 100–200× vs ANSYS Fluent; 8,799 MLUPS single A100 | A100 | FluidX3D |
cuda-mfcc-spectral | ~97× CPU MFCC; STFT ~75× via cuSignal | GTX 580 / RTX 30-series | cuSignal |
cuda-batched-lp-simplex | 95× over CPLEX; 5× over GLPK | GTX 980-class | arXiv 1802.08557 |
cuda-betweenness-centrality-weighted | 30–150× warp-centric weighted BC | GTX onwards | arXiv 1701.05975 |
cuda-cudasift-orb-ransac | ~60× SIFT CPU→GPU; ORB 11.3× | GTX 1060+ | CudaSift |
cuda-cudasw-gasal2 | CUDASW++4.0 16.2×; 5.71 TCUPS on H100 | H100 | CUDASW++4.0 |
cuda-loopy-bp-mrf | 45× over CPU LBP for stereo MRF | GTX 280+ | arXiv 2509.22337 |
cuda-sgm-stereo | 42 fps @ 640×480, 128 disparities | Tegra X1 / discrete | arXiv 1610.04121 |
cuda-hungarian-lap | 10–50×; 400 M-variable LAP in ~13 s | NVIDIA GPU | ScienceDirect |
cuda-msm-bls12-381 | 27.86× over Pippenger AVX baseline | A100 / RTX 4090 | SimdMSM TCHES |
cuda-pdwt-lifting | 15.9× over best optimized CPU DWT | GTX / Tesla | PDWT |
cuda-tensornet-contract | 8–20× vs CuPy; tensor QR ~100× vs Xeon | A100 | cuTensorNet |
cuda-ega-gpu-aggregation | 6.45–29.12× multi-pass; group-by 19.4× | NVIDIA GPU | VLDB Top-k EGA |
cuda-bicgstab-ilu-spmv | SpTRSV 10.7×; BiCGSTAB 3.2× vs cuSPARSE | V100 / MI210 | arXiv 2508.04917 |
cuda-icicle-snark-groth16 | fastest Groth16 today; NTT 91% of prover | RTX 4090 / A100 | ICICLE-Snark |
Surveyed forms (Wave 4)
| form | throughput | hardware | ref |
|---|---|---|---|
cuda-kalman-batched | 1386× for 5000-component measurements | various | CUDAkalmanFilter |
cuda-g6k-tensor-sieve | 1230× vs G6K CPU sieve at dim 120; SVP record dim 180 on 4 Turing GPUs | 4 Turing | Ducas/Stevens/van Woerden EC 2021 |
cuda-zkspeed-sumcheck-hyperplonk | 801× geomean over CPU; sumcheck 8.4 s → 9.5 ms | full-chip accelerator | zkSpeed HPCA 2025 |
cuda-kyber-batched-ntt | ~451× batched (B=65k); HI-Kyber 6.47× over prior GPU SOTA | RTX 3080 | HI-Kyber |
cuda-ironman-ote | 237× OT throughput vs full-thread CPU | near-memory variant | Ironman arXiv 2507.16391 |
cuda-cudss-cholesky | >100× vs QDLDL; 20× vs CHOLMOD factor | NVIDIA | NVIDIA cuDSS |
cuda-particle-filter | ~150× absolute (5000 particles @ 170 Hz) | GPGPU | EURASIP J ASP 2013 |
cuda-rk-stiff-chemkin | 126× vs single-core; 25× vs 6-core for hydrogen RKCK 524k ODEs | GPGPU | Niemeyer & Sung |
cuda-fem-assembly-jit | 87× assembly vs serial CPU; 126× peak numerical integration | GPGPU | Mironov et al. |
cuda-cufalcon-sign | 201k sig/s Falcon-512 on A100; verify 2.72M sig/s, 29.5× vs AVX2 | A100 | cuFalcon eprint 2025/249 |
cuda-cudahull-3d | 30–40× over Qhull CPU | NVIDIA | CudaHull CAG 2012 |
cuda-fastplay-garbled | 35–40× over serial garbling on GPU cluster | GPU cluster | Fastplay eprint 2011/097 |
cuda-air-fri | ~22.8× avg end-to-end ZK speedup; FRI commitment | GPGPU | Air-FRI SAC 2025 |
cuda-rabin-fingerprint | 16× over single-thread CPU; 40 Gbps absolute | GTX 780 (HARENS) | HARENS CloudCom 2016 |
cuda-gdel3d | 10× over CGAL 3D Delaunay; 70× Voronoi/jump-flood at 10M points | NVIDIA | gDel3D I3D 2014 |
cuda-perasure-crs | 10× vs multithread Jerasure; 10 GB/s on GTX780 absolute | GTX 780 | PErasure IEEE Cluster 2015 |
cuda-piranha-mpc | 4× vs CryptGPU on VGG16 private inference; full 3/4-party stacks single-GPU | single GPU | Piranha USENIX Sec 2022 |
cuda-scamp-matrix-profile | quintillion pairwise comparisons / day (absolute) | GPGPU | SCAMP |
cuda-cudtw-subseq | 2–3 orders of magnitude over UCR-Suite CPU; soft-DTW up to 5000× | Volta | cuDTW++ Euro-Par 2020 |
cuda-terachem-dft | 1–2 orders of magnitude over CPU; 8–50× vs GAMESS on 256-core cluster | 4× Tesla | TeraChem |
Wave 4 filter-outs: MAFFT MSA (11–20×, surpassed), RAxML likelihood (32× kernel only, ~3–10× end-to-end), AmgX CG (3–4× vs AmgX baseline), BVH Karras LBVH (2–3× over prior GPU LBVH), LDPC decode (40–160 Mbps, not ≥10× over modern SIMD CPU), mesh decimation (application-dependent), discrete Gaussian sampler (single-digit % gains; fold into Kyber NTT). Revisit when the published number changes.
Why a form earns its slot
A form is GPU-worth-it when at least one of:
- Embarrassingly parallel. N independent items, no cross-item dependency.
- Dense, branch-free inner loop. Same operation on every element.
- Reduction-friendly. Tree-reduce / prefix-sum / parallel-scan patterns.
- Big batch amortizes fixed kernel overhead.
When none of these hold, find a different decomposition: parallelize on a different axis, or stay on CPU & fan out across fleet hosts.
Source & specs
examples/cuda-fanout/ — wire contract, daemon protocol, bench data, per-tier integration.
CATALOG.md — canonical source for form metadata.