CVPR 2026

NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code

Converting research papers into trainable Nerfstudio plugins with 100% success rate, while existing approaches fail 95% of the time.

Seemandhar Jain1 Keshav Gupta1 Kunal Gupta1 Manmohan Chandraker1
1University of California, San Diego
Paper Supplementary arXiv Video Talk Code (Coming Soon)
TL;DR — NERFIFY uses six specialized AI agents to read a NeRF research paper, recover all its implicit dependencies from cited work, and synthesize a complete Nerfstudio plugin that trains, converges, and matches expert-written code within ±0.5 dB PSNR. It works on papers that have no public code, reducing implementation time from weeks to hours.
NERFIFY Overview: Manual NeRF implementation requires weeks of specialized effort. Existing paper-to-code systems fail to produce trainable code. NERFIFY automates this through grammar-constrained synthesis and compositional citation recovery.
Figure 1. Manual NeRF implementation requires weeks of specialized effort (left). Existing paper-to-code systems fail to produce trainable code. NERFIFY automates this process through grammar-constrained synthesis and compositional citation recovery, generating fully trainable Nerfstudio plugins in hours (right).
100%
Trainable Code
Success Rate
5%
Best Baseline
Success Rate
±0.5 dB
PSNR Gap vs
Expert Code
30
Papers in
NERFIFY-BENCH

Bridging the NeRF Reproducibility Gap

Neural radiance field (NeRF) research has expanded rapidly, creating a critical bottleneck: most papers ship without code, forcing researchers to spend weeks reimplementing methods before building upon them. We introduce NERFIFY, a multi-agent framework that converts NeRF research papers into trainable Nerfstudio plugins with 100% success rate, while baselines like Paper2Code, AutoP2C, and GPT-5 fail 95% of the time to produce runnable code.

Unlike generic paper-to-code systems that prioritize breadth, NERFIFY achieves domain-specific executability through six key innovations: context-free grammar formalization of Nerfstudio, Graph-of-Thought multi-agent synthesis, compositional citation recovery, closed-loop visual refinement, agentic knowledge enhancement, and the NERFIFY-BENCH evaluation framework. On research papers without public implementations, NERFIFY achieves visual quality matching expert human code (±0.5 dB PSNR, ±0.02 SSIM) while reducing implementation time from weeks to hours.

NERFIFY in Action

Watch NERFIFY convert a NeRF research paper into a fully trainable Nerfstudio plugin, from PDF parsing through compositional citation recovery to final rendering.

Six Innovations for Reliable Paper-to-Code

1. Context-Free Grammar

We formalize Nerfstudio as a CFG that constrains LLM synthesis, ensuring generated code satisfies architectural invariants by construction.

2. Graph-of-Thought Synthesis

Specialized multi-file agents generate repositories in topological dependency order, validating interface contracts at each node.

3. Compositional Citation Recovery

Agents automatically retrieve and integrate missing components (samplers, encoders, proposal networks) from citation graphs of referenced papers.

4. Closed-Loop Visual Refinement

PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching iteratively improve rendering quality.

5. Agentic Knowledge Enhancement

Beyond reproduction, NERFIFY identifies optimization opportunities in existing implementations, discovering missing regularizers and architectural refinements.

6. NERFIFY-BENCH Benchmark

The first evaluation framework for NeRF paper-to-code synthesis, covering 30 diverse papers across four distinct categories.

Four-Stage Pipeline

NERFIFY converts NeRF papers into executable code through four coordinated stages, each addressing a specific challenge in automated code synthesis for complex vision systems.

NERFIFY four-stage pipeline: CFG Formalization, Compositional Dependency Resolution, GoT Code Synthesis, Visual-Driven Feedback
Figure 2. NERFIFY converts NeRF papers into code through four stages: (1) Agent parses and summarizes PDFs into structured markdown; CFG from Nerfstudio and curated paper-code pairs serve as in-context examples saved in K. (2) Compositional dependency resolution traverses citation graphs to retrieve missing components from referenced papers. (3) GoT code synthesis generates repository files through specialized agents operating in topological order. (4) Visual refinement iteratively patches artifacts until achieving expert-level quality.
Stage 1

CFG Formalization & In-Context Learning

The agent parses and summarizes PDFs into structured markdown using MinerU, then maps content to Nerfstudio's grammar. Curated paper-code pairs serve as in-context examples for synthesis.

Stage 2

Compositional Dependency Resolution

Citation graphs are traversed to retrieve missing components from referenced papers. For example, implementing K-Planes requires components from 7 direct dependencies and 12 total papers.

Stage 3

Grammar-Guided Repository Generation

A Graph-of-Thought approach orchestrates specialized file-agents to generate code in topological order: DAG construction, interface freezing, implementation, and integration testing.

Stage 4

Visual-Driven Feedback

The critique agent diagnoses artifacts through PSNR-minima analysis, geometric validation, and VLM-guided patching, iteratively refining quality until convergence.

NeRF citation dependency graphs showing K-Planes requires components from 7 direct dependencies and 12 total papers
Figure 3. NeRF citation dependency graphs. Implementing K-Planes requires retrieving components from 7 direct dependencies (Plenoxels, TensoRF, Instant-NGP, Mip-NeRF 360, DyNeRF, EG3D, NeRF-W) and 12 total papers with transitive dependencies. Our compositional citation recovery automatically traverses such graphs to identify and retrieve all necessary components.
Graph-of-Thought Multi-Agent Code Synthesis showing progressive file generation through DAG construction, interface freezing, implementation, and integration testing
Figure 4. Graph-of-Thought (GoT) Multi-Agent Code Synthesis. The master agent orchestrates specialized file-agents that progressively build a NeRF repository. Each step shows files being created or modified through four phases: DAG Construction maps papers to Nerfstudio component dependencies, Interface Freeze establishes API contracts in topological order, Implementation generates validated code with shape/gradient checks, and Integration Testing runs smoke tests with automated repair.

Experimental Evaluation

We evaluate NERFIFY on the NERFIFY-BENCH dataset of 30 diverse NeRF papers across four categories. All experiments are conducted on NVIDIA A6000 GPUs (48 GB) with 100k training iterations on Blender and DTU datasets.

Executability Comparison

NERFIFY is the only system that consistently produces code that compiles, trains stably, and converges to paper-reported quality. All baselines fail to generate trainable code despite producing syntactically valid Python.

Metric Paper2Code AutoP2C GPT-5 R1 NERFIFY (Ours)
Imports Resolve
Compiles / Trainable
Training Stability
Converges to Paper Results

Set 1: Never-Implemented Papers

Comparison against expert human implementations for papers with no public code. All baselines failed to generate trainable code on these papers.

Paper Paper Reported Human Expert NERFIFY (Ours)
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
KeyNeRF 25.650.890.11 25.700.890.12 26.120.900.09
mi-MLP NeRF 24.700.890.09 22.640.870.15 22.850.870.15
ERS 27.850.940.06 26.870.900.12 27.020.900.12
TVNeRF 27.440.930.08 26.810.920.12 27.300.920.10
Anisotropic NeRF 34.080.970.05 28.850.940.06 29.010.940.06
NeRF-ID 25.150.94 23.010.890.13 23.100.890.13
HybNeRF 33.940.960.047 30.450.940.07 30.510.950.07
AR-NeRF 20.360.790.17 19.000.760.19 20.050.780.18
Visual comparison of Ground Truth, Expert Implementation, and NERFIFY output across KeyNeRF, mi-MLP NeRF, Anisotropic NeRF, ERS, and TVNeRF
Figure 5. Visual comparison of NERFIFY and human expert implementations on Set 1 (never-implemented papers). Left: Ground Truth. Middle: Expert Implementation. Right: NERFIFY. Our system reproduces fine details including specular highlights, geometric edges, and texture patterns, matching expert-level visual quality.
Additional visual comparisons showing LiNeRF, mi-MLP NeRF, NeRF-ID, and TVNeRF results
Figure 6. Extended qualitative comparisons from NERFIFY-BENCH Set 1. Left: Ground Truth. Middle: Expert Implementation. Right: NERFIFY. Across diverse scene types and NeRF architectures, NERFIFY consistently achieves rendering quality comparable to expert human implementations.

Sets 2 & 3: Comparison with Existing Implementations

NERFIFY achieves comparable or better performance when compared to original author repositories and gold-standard Nerfstudio integrations.

Method Original Repository NERFIFY (Ours)
PSNR↑SSIM↑LPIPS↓ PSNR↑SSIM↑LPIPS↓
Vanilla NeRF31.360.950.0431.360.950.04
Nerfacto20.360.820.2220.360.820.22
SeaThru-NeRF27.890.830.2230.080.920.07
InstantNGP32.7732.640.960.06
0 Sampler29.210.0430.130.970.03
DeblurNeRF32.080.930.5031.100.860.06

Multi-Agent Baseline Comparison

Semantic implementation scores and trainability across code generation systems. Only NERFIFY consistently produces trainable NeRF plugins.

Paper GPT-5 ChatDev MetaGPT DeepCode NERFIFY
ScoreTrain ScoreTrain ScoreTrain ScoreTrain ScoreTrain
KeyNeRF 0.85~ 0.25 0.30 0.60 1.00
FastNeRF 0.65 0.21 0.36 0.81 0.95
Vanilla NeRF 0.71 0.48 0.29 0.53 0.92
Deblur-NeRF 0.82~ 0.18 0.42 0.75 1.00
Average 0.76~ 0.28 0.34 0.67 0.97

Every Component Matters

We systematically ablate each component of NERFIFY to quantify its contribution. The results confirm the importance of domain knowledge, compositional reasoning, iterative validation, and visual refinement.

Configuration Score Trainable (%) Correct Novelties PSNR
NERFIFY (Full) 0.98 100 1.00 27.16
Knowledge Sources
  w/o In-context Examples 0.71 90 1.00
  w/o Citation Recovery 0.68 100 0.65
  w/o Both 0.58 90 0.65 23.22
Validation & Feedback
  w/o Smoke Tests 0.69 60 0.85
  w/o VLM Feedback + Smoke Tests 9.39
Planning Strategy
  One-Shot (no GoT) 0.45 70 1.00 24.52

Removing smoke tests drops trainability to 60%. Disabling citation recovery causes 35% of novel techniques to be missed. One-shot generation collapses semantic score to 0.45 despite correct equation implementation, revealing failures in module boundary establishment.

NERFIFY-BENCH: 30 Papers, Four Categories

We contribute NERFIFY-BENCH, the first evaluation framework specifically designed for NeRF paper-to-code synthesis. Papers are curated to cover diverse architectural innovations, training strategies, and integration complexity.

Categorization of NeRF Papers by Integrability in Nerfstudio framework
Figure 7. Categorization of NeRF papers by integrability in Nerfstudio. Papers are grouped by implementation difficulty. Category 1: Directly integrable methods modifying architecture, rendering, or losses. Category 2: Methods requiring pretrained models (CLIP, diffusion, SLAM) with substantial engineering effort. Category 3: Out-of-scope works where NeRF serves different objectives.
10
Never-Implemented
No public code exists. Expert reimplementations serve as ground truth. Avoids LLM training data contamination.
5
Non-Nerfstudio
Public code exists but is not Nerfstudio-integrated, enabling direct comparison with original author implementations.
5
Nerfstudio-Integrated
Gold-standard Nerfstudio references for evaluating how well synthesis matches expert framework integration.
10
Novelty-Coverage
Papers with distinct technical contributions (novel losses, architectures, training strategies) to evaluate innovation capture.

Each benchmark entry includes the frozen PDF, dual markdown representations (raw and cleaned), LaTeX source code, and ground truth repositories where available. Evaluation spans executability metrics (build success, import resolution, training stability) and rendering quality metrics (PSNR, SSIM, LPIPS) across standard benchmarks.

BibTeX

@inproceedings{jain2026nerfify, title = {NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code}, author = {Jain, Seemandhar and Gupta, Keshav and Gupta, Kunal and Chandraker, Manmohan}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2026} }
Paper