Converting research papers into trainable Nerfstudio plugins with 100% success rate, while existing approaches fail 95% of the time.
Neural radiance field (NeRF) research has expanded rapidly, creating a critical bottleneck: most papers ship without code, forcing researchers to spend weeks reimplementing methods before building upon them. We introduce NERFIFY, a multi-agent framework that converts NeRF research papers into trainable Nerfstudio plugins with 100% success rate, while baselines like Paper2Code, AutoP2C, and GPT-5 fail 95% of the time to produce runnable code.
Unlike generic paper-to-code systems that prioritize breadth, NERFIFY achieves domain-specific executability through six key innovations: context-free grammar formalization of Nerfstudio, Graph-of-Thought multi-agent synthesis, compositional citation recovery, closed-loop visual refinement, agentic knowledge enhancement, and the NERFIFY-BENCH evaluation framework. On research papers without public implementations, NERFIFY achieves visual quality matching expert human code (±0.5 dB PSNR, ±0.02 SSIM) while reducing implementation time from weeks to hours.
Watch NERFIFY convert a NeRF research paper into a fully trainable Nerfstudio plugin, from PDF parsing through compositional citation recovery to final rendering.
We formalize Nerfstudio as a CFG that constrains LLM synthesis, ensuring generated code satisfies architectural invariants by construction.
Specialized multi-file agents generate repositories in topological dependency order, validating interface contracts at each node.
Agents automatically retrieve and integrate missing components (samplers, encoders, proposal networks) from citation graphs of referenced papers.
PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching iteratively improve rendering quality.
Beyond reproduction, NERFIFY identifies optimization opportunities in existing implementations, discovering missing regularizers and architectural refinements.
The first evaluation framework for NeRF paper-to-code synthesis, covering 30 diverse papers across four distinct categories.
NERFIFY converts NeRF papers into executable code through four coordinated stages, each addressing a specific challenge in automated code synthesis for complex vision systems.
The agent parses and summarizes PDFs into structured markdown using MinerU, then maps content to Nerfstudio's grammar. Curated paper-code pairs serve as in-context examples for synthesis.
Citation graphs are traversed to retrieve missing components from referenced papers. For example, implementing K-Planes requires components from 7 direct dependencies and 12 total papers.
A Graph-of-Thought approach orchestrates specialized file-agents to generate code in topological order: DAG construction, interface freezing, implementation, and integration testing.
The critique agent diagnoses artifacts through PSNR-minima analysis, geometric validation, and VLM-guided patching, iteratively refining quality until convergence.
We evaluate NERFIFY on the NERFIFY-BENCH dataset of 30 diverse NeRF papers across four categories. All experiments are conducted on NVIDIA A6000 GPUs (48 GB) with 100k training iterations on Blender and DTU datasets.
NERFIFY is the only system that consistently produces code that compiles, trains stably, and converges to paper-reported quality. All baselines fail to generate trainable code despite producing syntactically valid Python.
| Metric | Paper2Code | AutoP2C | GPT-5 | R1 | NERFIFY (Ours) |
|---|---|---|---|---|---|
| Imports Resolve | ✓ | ✗ | ✓ | ✓ | ✓ |
| Compiles / Trainable | ✗ | ✗ | ✗ | ✗ | ✓ |
| Training Stability | ✗ | ✗ | ✗ | ✗ | ✓ |
| Converges to Paper Results | ✗ | ✗ | ✗ | ✗ | ✓ |
Comparison against expert human implementations for papers with no public code. All baselines failed to generate trainable code on these papers.
| Paper | Paper Reported | Human Expert | NERFIFY (Ours) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | |
| KeyNeRF | 25.65 | 0.89 | 0.11 | 25.70 | 0.89 | 0.12 | 26.12 | 0.90 | 0.09 |
| mi-MLP NeRF | 24.70 | 0.89 | 0.09 | 22.64 | 0.87 | 0.15 | 22.85 | 0.87 | 0.15 |
| ERS | 27.85 | 0.94 | 0.06 | 26.87 | 0.90 | 0.12 | 27.02 | 0.90 | 0.12 |
| TVNeRF | 27.44 | 0.93 | 0.08 | 26.81 | 0.92 | 0.12 | 27.30 | 0.92 | 0.10 |
| Anisotropic NeRF | 34.08 | 0.97 | 0.05 | 28.85 | 0.94 | 0.06 | 29.01 | 0.94 | 0.06 |
| NeRF-ID | 25.15 | 0.94 | – | 23.01 | 0.89 | 0.13 | 23.10 | 0.89 | 0.13 |
| HybNeRF | 33.94 | 0.96 | 0.047 | 30.45 | 0.94 | 0.07 | 30.51 | 0.95 | 0.07 |
| AR-NeRF | 20.36 | 0.79 | 0.17 | 19.00 | 0.76 | 0.19 | 20.05 | 0.78 | 0.18 |
NERFIFY achieves comparable or better performance when compared to original author repositories and gold-standard Nerfstudio integrations.
| Method | Original Repository | NERFIFY (Ours) | ||||
|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | |
| Vanilla NeRF | 31.36 | 0.95 | 0.04 | 31.36 | 0.95 | 0.04 |
| Nerfacto | 20.36 | 0.82 | 0.22 | 20.36 | 0.82 | 0.22 |
| SeaThru-NeRF | 27.89 | 0.83 | 0.22 | 30.08 | 0.92 | 0.07 |
| InstantNGP | 32.77 | – | – | 32.64 | 0.96 | 0.06 |
| ℓ0 Sampler | 29.21 | – | 0.04 | 30.13 | 0.97 | 0.03 |
| DeblurNeRF | 32.08 | 0.93 | 0.50 | 31.10 | 0.86 | 0.06 |
Semantic implementation scores and trainability across code generation systems. Only NERFIFY consistently produces trainable NeRF plugins.
| Paper | GPT-5 | ChatDev | MetaGPT | DeepCode | NERFIFY | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Score | Train | Score | Train | Score | Train | Score | Train | Score | Train | |
| KeyNeRF | 0.85 | ~ | 0.25 | ✗ | 0.30 | ✗ | 0.60 | ✗ | 1.00 | ✓ |
| FastNeRF | 0.65 | ✗ | 0.21 | ✗ | 0.36 | ✗ | 0.81 | ✗ | 0.95 | ✓ |
| Vanilla NeRF | 0.71 | ✓ | 0.48 | ✗ | 0.29 | ✗ | 0.53 | ✗ | 0.92 | ✓ |
| Deblur-NeRF | 0.82 | ~ | 0.18 | ✗ | 0.42 | ✗ | 0.75 | ✗ | 1.00 | ✓ |
| Average | 0.76 | ~ | 0.28 | ✗ | 0.34 | ✗ | 0.67 | ✗ | 0.97 | ✓ |
We systematically ablate each component of NERFIFY to quantify its contribution. The results confirm the importance of domain knowledge, compositional reasoning, iterative validation, and visual refinement.
| Configuration | Score | Trainable (%) | Correct Novelties | PSNR |
|---|---|---|---|---|
| NERFIFY (Full) | 0.98 | 100 | 1.00 | 27.16 |
| Knowledge Sources | ||||
| w/o In-context Examples | 0.71 | 90 | 1.00 | – |
| w/o Citation Recovery | 0.68 | 100 | 0.65 | – |
| w/o Both | 0.58 | 90 | 0.65 | 23.22 |
| Validation & Feedback | ||||
| w/o Smoke Tests | 0.69 | 60 | 0.85 | – |
| w/o VLM Feedback + Smoke Tests | – | – | – | 9.39 |
| Planning Strategy | ||||
| One-Shot (no GoT) | 0.45 | 70 | 1.00 | 24.52 |
Removing smoke tests drops trainability to 60%. Disabling citation recovery causes 35% of novel techniques to be missed. One-shot generation collapses semantic score to 0.45 despite correct equation implementation, revealing failures in module boundary establishment.
We contribute NERFIFY-BENCH, the first evaluation framework specifically designed for NeRF paper-to-code synthesis. Papers are curated to cover diverse architectural innovations, training strategies, and integration complexity.
Each benchmark entry includes the frozen PDF, dual markdown representations (raw and cleaned), LaTeX source code, and ground truth repositories where available. Evaluation spans executability metrics (build success, import resolution, training stability) and rendering quality metrics (PSNR, SSIM, LPIPS) across standard benchmarks.