lumbda

bend — dispatch to a GPU without rewriting your code

bend is a Lumbda primitive that decides per call whether to evaluate locally or ship to a CUDA worker over our wire protocol. Tiny inputs stay local; heavy inputs bend to a worker that holds a warm CUDA context across requests. The decision uses a cost estimator on the argument shape, not the operation name.

Start a GPU worker

# On any host with nvcc + a CUDA-capable GPU:
make gpu-worker
# → builds examples/cuda-fanout/shake256-fanout
# → builds the C tier (~10× faster wire orchestration than Python)
# → launches gpu-worker.lsp on TWO ports:
#     8320  — wire-TCP protocol (native callers: bend.lsp, asm clients)
#     8321  — HTTP/1.1 + CORS  (browser callers: the playground/REPL tabs)

# Port 8320 = BEND mnemonic:
#   8 ~= B (implied infinity B flattened; bake a cake; baby & me)
#   3 ~= E (backward)
#   2 ~= N (pivoted 90 degrees)
#   0 ~= D (flattened)
# Port 8321 = 8320 + 1 (HTTP is always wire-port + 1, override via --http-port).

# Override tier (default port stays 8320 / BEND):
make gpu-worker LUMBDA=python             # easier debugging
make gpu-worker LUMBDA=asm                # smallest footprint
# Override port too (only when running a second worker on the same host):
make gpu-worker LUMBDA=python PORT=9320   # second worker (wire 9320, http 9321)

Run your own bend → use it from a browser tab

Browsers cannot open raw TCP sockets, but they can fetch / XHR to http://localhost:<port> even from an HTTPS page (a permissive "potentially trustworthy" exception every major browser ships). So the playground field accepts a URL that points at a bend worker running on your own machine. No fox-owned ingress, no certificate, no proxy hop.

# 1. Start a worker on your laptop / desktop / homelab
git clone https://lumbda.com/lumbda.git && cd lumbda
make gpu-worker LUMBDA=python   # CPU works; CUDA path lights up if nvcc is present
# → "gpu-worker listening on port 8320"
# → "gpu-worker HTTP listening on port 8321"

# 2. In a separate shell, smoke-test the HTTP path
curl -sS -X POST -H 'Content-Type: text/plain' \
    --data-binary '(ping)' http://localhost:8321/
# → (ok pong)

# 3. Open https://lumbda.com/playground/ and paste into the ⚡ bend field:
#      http://localhost:8321/
# Save, then in the REPL:
#      λ> (bend!-call '(ping))

Both ports share the same handle-request dispatcher: whatever an op-head (ping, echo, cuda-shake-fanout, cuda-sim-ops-bin, …) returns over wire is what HTTP returns. Adding a new op-head exposes it on both transports automatically.

HTTP scope today is the S-expression text path — for browser-friendly recipes the server understands today, with no binary blobs crossing the wire. Binary modes (BSHK / BCGB / BSCP / BSRT / BSB3) stay wire-only because they exist for native callers who already hold the binary locally.

Phase 2 (deferred until authentication lands): upload an .lsp recipe describing a champion circuit, dispatch into a server-side factory that compiles the .bin on the worker's filesystem and bends it. Same playground field, same single endpoint. Auth gates write access; today a worker on the public internet would let any caller occupy your GPU.

Call it from any tier

;; bend works on every tier — Python, C, asm — through the same
;; tcp-* + portal primitives lumbda already ships.
(load "examples/cuda-fanout/wire.lsp")
(load "examples/cuda-fanout/bend.lsp")
(load "examples/cuda-fanout/bend-macros.lsp")  ; Python/C only — asm uses bend-call

;; Tiny — cost below threshold, evaluates locally
(bend (cuda-shake-fanout '("00" "01" "deadbeef") 32))

;; Heavy — cost above threshold, ships to the GPU worker
(bend (cuda-shake-fanout one-million-inputs 32))

Wire protocol

Two modes: S-expression text (the default) and binary (magic BSHK header + raw bytes). Binary mode bypasses S-expression parsing entirely.

workloadPy S-expPy binaryC S-expC binary
100 × 16 B3.43 ms0.74 ms0.40 ms0.15 ms
1k × 16 B23.24 ms0.76 ms2.77 ms0.22 ms
10k × 16 B218.82 ms1.27 msCLIFF0.88 ms
100k × 16 B2,219 ms10.18 msCLIFF10.35 ms
1M × 16 B23,811 ms159 msCLIFF157 ms

Binary mode wins by 30–200× over S-expression at scale; at 1 M × 16 B inputs C tier binary is 157 ms vs 23,811 ms for S-exp, and bend beats host hashlib by ~12×. The CUDA toolchain stays isolated to the leaf binary the worker spawns — no tier links libcudart; asm tier hosts workers through hand-written pipe2 + fork + execve syscalls.

Fleet

Production default: single host. Multi-host fan-out is built (round-robin in bend.lsp via *bend-workers* + BEND_WORKERS env), available on demand for long-running parallel workloads — not used routinely.

hostGPUarchportstatus
3090-ai.foxhop.netRTX 3090 (24 GB)sm_868320active — production worker
ai.foxhop.netRTX 4090 (24 GB)sm_898320active — 3-node mesh
cammy.foxhop.netTesla P40 (24 GB)sm_618320active — 3-node mesh (Pascal)

3-node mesh active. 3090 + 4090 + P40 all serve bend workloads on port 8320 (BEND mnemonic). Round-robin or explicit selection via (bend-set-workers! '(("3090-ai.foxhop.net" . 8320) ("ai.foxhop.net" . 8320) ("cammy.foxhop.net" . 8320))). Cost-routed coordinator pattern: greedy bin-pack candidates by predicted cost; per-host wall factor 1.00 / 0.51 / 1.84 (3090 / 4090 / P40 — Pascal lags but contributes a third concurrent stream); one worker thread per node fires concurrently against our wire protocol. Each node serializes its own queue (workers do not multiplex requests cleanly inside one CUDA context). Full hardware table, bench numbers, & LLM-coexistence detail at foxhop gpu-mesh.

Worker health heartbeat

Every dispatch goes through a lazy-refresh health probe so bend never wastes a payload on a worker that is down, overloaded, or VRAM-starved. The worker exposes a (health) op returning measured numbers; the client caches results & ranks healthy peers by free VRAM descending.

Worker side — (health)

; over the same wire as any other op:
(health)
→ (ok (load-avg 0.42)
         (vram-free-mb 22777)
         (uptime-ms 1780752288371))

Backward compatible: older workers without a (health) handler return (error (unknown-op health)) — the client treats that as ok with vram-free-mb=0 so a pre-heartbeat build still ranks as available.

Client side — *worker-health* cache

An alist keyed by "host:port" maps each known peer to a record (last-checked-ms status vram-mb) where status ∈ {ok, down}.

Failure handling

Mid-flight failures in bend-dispatch-to-gputcp-connect raising on a connect-refused, the worker dropping a connection mid-reply, a malformed reply — flip the worker to down. The next pick skips it for the cooldown window. A single failed call never aborts a multi-worker iteration: probes wrap in with-exception-handler so a dead peer surfaces as 'transport-fail rather than propagating up.

Form catalog

A form earns a slot here only after we have published a benchmark or measured one on our hardware. "I think this would be fast" does not earn a slot — the form-status column says planned until numbers exist.

Status legend

Live forms

formhardwarethroughputwire shape
cuda-shake-fanout RTX 3090 12× host hashlib at 1 M × 16 B inputs (cuda-shake-fanout '(hex ...) out-bytes) + BSHK binary
cuda-sim-ops-bin RTX 3090 1.27× at 141 batches (9024 shots); crossover ~115 batches; SHAKE-RNG mode matches upstream Pareto bit-for-bit (Σ Toffoli = 15,999,651,264, avg Toffoli = 1,773,011.000 on stock ops.bin). Kernel wall scales linearly with Σ Toffoli across circuit variantsfoxhop secp256k1 lever sweep at p=251 (n+1=9 secp256k1-toy) over 6 lumbda-emitted ops.bin variants ran 9.83× wall reduction Fermat-schoolbook → refined-Solinas, 261.4 ms → 26.6 ms / batch at 1024 shots; matches 6.95× Σ Toffoli ratio + parallel Clifford drop. No scheduler surprise: pick any QECCOPS1 circuit, predict GPU wall from a cheap CPU op-counter. (cuda-sim-ops-bin "path/to/ops.bin" n-batches)demo_ops defaults --rng-mode shake (Fiat-Shamir over op stream); --rng-mode lfsr keeps legacy xorshift for debug parity
cuda-sim-axis-flip RTX 3090 217 Mops/s @ K=32 M=4 (many-candidates × few-shots) (cuda-sim-axis (variant-paths ...) n-shots)
cuda-bignum-cgbn RTX 3090 1.28 Gops/s kernel mod-mul @ n=1M (256-bit, ~256× GMP CPU); 9 ops BCGB binary: op_id + bitwidth + n + modulus + a + b
cuda-secp256k1-batched-mul RTX 3090 13.83 Mkeys/s @ n=1M (~309× coincurve CPU; windowed-G ladder w=4) BSCP binary: scalars + base-point → BSCR points
cuda-radix-sort RTX 3090 5.50 Gkeys/s @ n=10M (kernel 1.82 ms); 3.77 Gkeys/s @ n=1M; CUB DeviceRadixSort u64 ascending BSRT binary: op_id + n + u64[n] → BSRR sorted u64[n]
cuda-blake3-tree RTX 3090 32.5 GB/s @ 1M × 64 B (kernel 1.97 ms); 20.4 GB/s @ 1k × 1 MB (kernel 51.5 ms); byte-identical to BLAKE3 reference spec across single-chunk + multi-chunk paths BSB3 binary: out_bytes + n + (u32 len + bytes) per input → BSR3 n + out_bytes + digests

Surveyed forms (Wave 1)

formthroughputhardwareref
cuda-secp256k1-batched-mul6.5 Gkeys/s VanitySearch RTX 4090; 8.6 Gkeys/s RTX 5090; 2.65 Gkeys/s RTX 3080(promoted — see Live)gECC · Bitcrack
cuda-bignum-cgbn100×+ on dense mul vs Xeon-20c + GMP + OpenMPV100NVlabs CGBN · midsize-int
cuda-rho-pollard-walk87.7 M ops/sec for ECCp79RTX 2070 Superatlomak · oritwoen
cuda-clifford-stabilizer186× over Stim (CPU SOTA) on equivalence-checkingSTABSimSTABSim · Qimax
cuda-bernstein-yang-inv3–10× per inversion over Fermat on CPU; novel territory on GPUsafegcd · Jumping
cuda-ntt-polyup to 123× over CPU; 21× on RTX 3070RTX 3070NTTSuite · FHE NTT
cuda-radix-sort1.4 G keys/sec; 20–50× over CPU merge sort; 257× vs Xeon Phi for scan(promoted — see Live)CUB · Onesweep

Surveyed forms (Wave 2)

Sorted by reported speedup descending.

formthroughputhardwareref
cuda-minhash-weighted600–1000× vs numpy+MKLTitan X vs Xeon E5-1650src-d/minhashcuda
cuda-cuckoo-filter378× insert, 258× deleteA100arXiv:2603.15486
cuda-aes-ctr-chacha20211–400 GB/ssingle GPUAsyncGBP
cuda-suffix-array-skew30–242× vs CPU SA-ISTesla K20Liu/Luo
cuda-kdtree-build30–242× build, 1.6–200× kNNRTX (RT cores)Zhou et al.
cuda-sat-paraFROST-elim93× peak, 48× avg variable elimNVIDIA + Kissat baselineParaFROST
cuda-aho-corasick-pfac~50–100× IDS pkt-inspectGTX-classPFAC
cuda-dilithium-pqsig57.7× keygen+sign+verify vs single CPU threadRTX 3090 TiIACR 2024/1365
cuda-mc-options-pricing25–152×Tesla C1060 / modernGPU Gems Ch.45
cuda-cuFFT-batched-1D8–32× vs MKL; tcFFT 1.1–3.2× vs cuFFTV100 / A100tcFFT
cuda-blake3-tree32.5 GB/s @ 1M × 64 B; 20.4 GB/s @ 1k × 1 MB on 3090(promoted — see Live)Blaze-3
cuda-bloom-filter-modern~6× CPU; 3.4 B inserts/sB200 / PerlmutterarXiv:2512.15595
cuda-gemm-batched-FP84.8× FP8 vs A100; 716 TFLOPS H100H100 SXMcuBLAS 12.0
cuda-batched-matrix-inverse4.3–16.8× vs MAGMAP100Superfri 2018
cuda-hash-join-radix4 B tuples/s single; 1.8 T tuples/s on 1024 A100A100 clusterADMS-21
cuda-kmer-count4–6× vs KMC2RapidGKC, GerbilRapidGKC
cuda-cgraph-traversal38 B TEPSDGX2cuGraph
cuda-triangle-count-TRUST~1 T TEPSmulti-A100TRUST
cuda-ldpc-bp-decoder10 Gbps with early-terminationGPGPUMDPI Electronics 2022
cuda-nvcomp-zstd2.2× zstd; 1.4× LZ4; 1.9× snappyH100 / A100nvCOMP

Surveyed forms (Wave 3)

formthroughputhardwareref
cuda-fluidx3d-lbm100–200× vs ANSYS Fluent; 8,799 MLUPS single A100A100FluidX3D
cuda-mfcc-spectral~97× CPU MFCC; STFT ~75× via cuSignalGTX 580 / RTX 30-seriescuSignal
cuda-batched-lp-simplex95× over CPLEX; 5× over GLPKGTX 980-classarXiv 1802.08557
cuda-betweenness-centrality-weighted30–150× warp-centric weighted BCGTX onwardsarXiv 1701.05975
cuda-cudasift-orb-ransac~60× SIFT CPU→GPU; ORB 11.3×GTX 1060+CudaSift
cuda-cudasw-gasal2CUDASW++4.0 16.2×; 5.71 TCUPS on H100H100CUDASW++4.0
cuda-loopy-bp-mrf45× over CPU LBP for stereo MRFGTX 280+arXiv 2509.22337
cuda-sgm-stereo42 fps @ 640×480, 128 disparitiesTegra X1 / discretearXiv 1610.04121
cuda-hungarian-lap10–50×; 400 M-variable LAP in ~13 sNVIDIA GPUScienceDirect
cuda-msm-bls12-38127.86× over Pippenger AVX baselineA100 / RTX 4090SimdMSM TCHES
cuda-pdwt-lifting15.9× over best optimized CPU DWTGTX / TeslaPDWT
cuda-tensornet-contract8–20× vs CuPy; tensor QR ~100× vs XeonA100cuTensorNet
cuda-ega-gpu-aggregation6.45–29.12× multi-pass; group-by 19.4×NVIDIA GPUVLDB Top-k EGA
cuda-bicgstab-ilu-spmvSpTRSV 10.7×; BiCGSTAB 3.2× vs cuSPARSEV100 / MI210arXiv 2508.04917
cuda-icicle-snark-groth16fastest Groth16 today; NTT 91% of proverRTX 4090 / A100ICICLE-Snark

Surveyed forms (Wave 4)

formthroughputhardwareref
cuda-kalman-batched1386× for 5000-component measurementsvariousCUDAkalmanFilter
cuda-g6k-tensor-sieve1230× vs G6K CPU sieve at dim 120; SVP record dim 180 on 4 Turing GPUs4 TuringDucas/Stevens/van Woerden EC 2021
cuda-zkspeed-sumcheck-hyperplonk801× geomean over CPU; sumcheck 8.4 s → 9.5 msfull-chip acceleratorzkSpeed HPCA 2025
cuda-kyber-batched-ntt~451× batched (B=65k); HI-Kyber 6.47× over prior GPU SOTARTX 3080HI-Kyber
cuda-ironman-ote237× OT throughput vs full-thread CPUnear-memory variantIronman arXiv 2507.16391
cuda-cudss-cholesky>100× vs QDLDL; 20× vs CHOLMOD factorNVIDIANVIDIA cuDSS
cuda-particle-filter~150× absolute (5000 particles @ 170 Hz)GPGPUEURASIP J ASP 2013
cuda-rk-stiff-chemkin126× vs single-core; 25× vs 6-core for hydrogen RKCK 524k ODEsGPGPUNiemeyer & Sung
cuda-fem-assembly-jit87× assembly vs serial CPU; 126× peak numerical integrationGPGPUMironov et al.
cuda-cufalcon-sign201k sig/s Falcon-512 on A100; verify 2.72M sig/s, 29.5× vs AVX2A100cuFalcon eprint 2025/249
cuda-cudahull-3d30–40× over Qhull CPUNVIDIACudaHull CAG 2012
cuda-fastplay-garbled35–40× over serial garbling on GPU clusterGPU clusterFastplay eprint 2011/097
cuda-air-fri~22.8× avg end-to-end ZK speedup; FRI commitmentGPGPUAir-FRI SAC 2025
cuda-rabin-fingerprint16× over single-thread CPU; 40 Gbps absoluteGTX 780 (HARENS)HARENS CloudCom 2016
cuda-gdel3d10× over CGAL 3D Delaunay; 70× Voronoi/jump-flood at 10M pointsNVIDIAgDel3D I3D 2014
cuda-perasure-crs10× vs multithread Jerasure; 10 GB/s on GTX780 absoluteGTX 780PErasure IEEE Cluster 2015
cuda-piranha-mpc4× vs CryptGPU on VGG16 private inference; full 3/4-party stacks single-GPUsingle GPUPiranha USENIX Sec 2022
cuda-scamp-matrix-profilequintillion pairwise comparisons / day (absolute)GPGPUSCAMP
cuda-cudtw-subseq2–3 orders of magnitude over UCR-Suite CPU; soft-DTW up to 5000×VoltacuDTW++ Euro-Par 2020
cuda-terachem-dft1–2 orders of magnitude over CPU; 8–50× vs GAMESS on 256-core cluster4× TeslaTeraChem

Wave 4 filter-outs: MAFFT MSA (11–20×, surpassed), RAxML likelihood (32× kernel only, ~3–10× end-to-end), AmgX CG (3–4× vs AmgX baseline), BVH Karras LBVH (2–3× over prior GPU LBVH), LDPC decode (40–160 Mbps, not ≥10× over modern SIMD CPU), mesh decimation (application-dependent), discrete Gaussian sampler (single-digit % gains; fold into Kyber NTT). Revisit when the published number changes.

Why a form earns its slot

A form is GPU-worth-it when at least one of:

  1. Embarrassingly parallel. N independent items, no cross-item dependency.
  2. Dense, branch-free inner loop. Same operation on every element.
  3. Reduction-friendly. Tree-reduce / prefix-sum / parallel-scan patterns.
  4. Big batch amortizes fixed kernel overhead.

When none of these hold, find a different decomposition: parallelize on a different axis, or stay on CPU & fan out across fleet hosts.

Source & specs

examples/cuda-fanout/ — wire contract, daemon protocol, bench data, per-tier integration.
CATALOG.md — canonical source for form metadata.