NERFIFY: Turning NeRF Papers into Code

Abstract

Bridging the NeRF Reproducibility Gap

Neural radiance field (NeRF) research has expanded rapidly, creating a critical bottleneck: most papers ship without code, forcing researchers to spend weeks reimplementing methods before building upon them. We introduce NERFIFY, a multi-agent framework that converts NeRF research papers into trainable Nerfstudio plugins with 100% success rate, while baselines like Paper2Code, AutoP2C, and GPT-5 fail 95% of the time to produce runnable code.

Unlike generic paper-to-code systems that prioritize breadth, NERFIFY achieves domain-specific executability through six key innovations: context-free grammar formalization of Nerfstudio, Graph-of-Thought multi-agent synthesis, compositional citation recovery, closed-loop visual refinement, agentic knowledge enhancement, and the NERFIFY-BENCH evaluation framework. On research papers without public implementations, NERFIFY achieves visual quality matching expert human code (±0.5 dB PSNR, ±0.02 SSIM) while reducing implementation time from weeks to hours.

Key Contributions

Six Innovations for Reliable Paper-to-Code

1. Context-Free Grammar

We formalize Nerfstudio as a CFG that constrains LLM synthesis, ensuring generated code satisfies architectural invariants by construction.

2. Graph-of-Thought Synthesis

Specialized multi-file agents generate repositories in topological dependency order, validating interface contracts at each node.

3. Compositional Citation Recovery

Agents automatically retrieve and integrate missing components (samplers, encoders, proposal networks) from citation graphs of referenced papers.

4. Closed-Loop Visual Refinement

PSNR-minima ROI analysis, cross-view geometric validation, and VLM-guided patching iteratively improve rendering quality.

5. Agentic Knowledge Enhancement

Beyond reproduction, NERFIFY identifies optimization opportunities in existing implementations, discovering missing regularizers and architectural refinements.

6. NERFIFY-BENCH Benchmark

The first evaluation framework for NeRF paper-to-code synthesis, covering 30 diverse papers across four distinct categories.

Method

Four-Stage Pipeline

NERFIFY converts NeRF papers into executable code through four coordinated stages, each addressing a specific challenge in automated code synthesis for complex vision systems.

Figure 2. NERFIFY converts NeRF papers into code through four stages: (1) Agent parses and summarizes PDFs into structured markdown; CFG from Nerfstudio and curated paper-code pairs serve as in-context examples saved in K. (2) Compositional dependency resolution traverses citation graphs to retrieve missing components from referenced papers. (3) GoT code synthesis generates repository files through specialized agents operating in topological order. (4) Visual refinement iteratively patches artifacts until achieving expert-level quality.

Stage 1

CFG Formalization & In-Context Learning

The agent parses and summarizes PDFs into structured markdown using MinerU, then maps content to Nerfstudio's grammar. Curated paper-code pairs serve as in-context examples for synthesis.

Stage 2

Compositional Dependency Resolution

Citation graphs are traversed to retrieve missing components from referenced papers. For example, implementing K-Planes requires components from 7 direct dependencies and 12 total papers.

Stage 3

Grammar-Guided Repository Generation

A Graph-of-Thought approach orchestrates specialized file-agents to generate code in topological order: DAG construction, interface freezing, implementation, and integration testing.

Stage 4

Visual-Driven Feedback

The critique agent diagnoses artifacts through PSNR-minima analysis, geometric validation, and VLM-guided patching, iteratively refining quality until convergence.

NeRF citation dependency graphs showing K-Planes requires components from 7 direct dependencies and 12 total papers

Figure 3. NeRF citation dependency graphs. Implementing K-Planes requires retrieving components from 7 direct dependencies (Plenoxels, TensoRF, Instant-NGP, Mip-NeRF 360, DyNeRF, EG3D, NeRF-W) and 12 total papers with transitive dependencies. Our compositional citation recovery automatically traverses such graphs to identify and retrieve all necessary components.

Graph-of-Thought Multi-Agent Code Synthesis showing progressive file generation through DAG construction, interface freezing, implementation, and integration testing

Figure 4. Graph-of-Thought (GoT) Multi-Agent Code Synthesis. The master agent orchestrates specialized file-agents that progressively build a NeRF repository. Each step shows files being created or modified through four phases: DAG Construction maps papers to Nerfstudio component dependencies, Interface Freeze establishes API contracts in topological order, Implementation generates validated code with shape/gradient checks, and Integration Testing runs smoke tests with automated repair.

Results

Experimental Evaluation

We evaluate NERFIFY on the NERFIFY-BENCH dataset of 30 diverse NeRF papers across four categories. All experiments are conducted on NVIDIA A6000 GPUs (48 GB) with 100k training iterations on Blender and DTU datasets.

Executability Comparison

NERFIFY is the only system that consistently produces code that compiles, trains stably, and converges to paper-reported quality. All baselines fail to generate trainable code despite producing syntactically valid Python.

Metric	Paper2Code	AutoP2C	GPT-5	R1	NERFIFY (Ours)
Imports Resolve	✓	✗	✓	✓	✓
Compiles / Trainable	✗	✗	✗	✗	✓
Training Stability	✗	✗	✗	✗	✓
Converges to Paper Results	✗	✗	✗	✗	✓

Set 1: Never-Implemented Papers

Comparison against expert human implementations for papers with no public code. All baselines failed to generate trainable code on these papers.

Paper	Paper Reported			Human Expert			NERFIFY (Ours)
Paper	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
KeyNeRF	25.65	0.89	0.11	25.70	0.89	0.12	26.12	0.90	0.09
mi-MLP NeRF	24.70	0.89	0.09	22.64	0.87	0.15	22.85	0.87	0.15
ERS	27.85	0.94	0.06	26.87	0.90	0.12	27.02	0.90	0.12
TVNeRF	27.44	0.93	0.08	26.81	0.92	0.12	27.30	0.92	0.10
Anisotropic NeRF	34.08	0.97	0.05	28.85	0.94	0.06	29.01	0.94	0.06
NeRF-ID	25.15	0.94	–	23.01	0.89	0.13	23.10	0.89	0.13
HybNeRF	33.94	0.96	0.047	30.45	0.94	0.07	30.51	0.95	0.07
AR-NeRF	20.36	0.79	0.17	19.00	0.76	0.19	20.05	0.78	0.18

Visual comparison of Ground Truth, Expert Implementation, and NERFIFY output across KeyNeRF, mi-MLP NeRF, Anisotropic NeRF, ERS, and TVNeRF

Figure 5. Visual comparison of NERFIFY and human expert implementations on Set 1 (never-implemented papers). Left: Ground Truth. Middle: Expert Implementation. Right: NERFIFY. Our system reproduces fine details including specular highlights, geometric edges, and texture patterns, matching expert-level visual quality.

Additional visual comparisons showing LiNeRF, mi-MLP NeRF, NeRF-ID, and TVNeRF results

Figure 6. Extended qualitative comparisons from NERFIFY-BENCH Set 1. Left: Ground Truth. Middle: Expert Implementation. Right: NERFIFY. Across diverse scene types and NeRF architectures, NERFIFY consistently achieves rendering quality comparable to expert human implementations.

Sets 2 & 3: Comparison with Existing Implementations

NERFIFY achieves comparable or better performance when compared to original author repositories and gold-standard Nerfstudio integrations.

Method	Original Repository			NERFIFY (Ours)
Method	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
Vanilla NeRF	31.36	0.95	0.04	31.36	0.95	0.04
Nerfacto	20.36	0.82	0.22	20.36	0.82	0.22
SeaThru-NeRF	27.89	0.83	0.22	30.08	0.92	0.07
InstantNGP	32.77	–	–	32.64	0.96	0.06
ℓ₀ Sampler	29.21	–	0.04	30.13	0.97	0.03
DeblurNeRF	32.08	0.93	0.50	31.10	0.86	0.06

Multi-Agent Baseline Comparison

Semantic implementation scores and trainability across code generation systems. Only NERFIFY consistently produces trainable NeRF plugins.

Paper	GPT-5		ChatDev		MetaGPT		DeepCode		NERFIFY
	Score	Train	Score	Train	Score	Train	Score	Train	Score	Train
KeyNeRF	0.85	~	0.25	✗	0.30	✗	0.60	✗	1.00	✓
FastNeRF	0.65	✗	0.21	✗	0.36	✗	0.81	✗	0.95	✓
Vanilla NeRF	0.71	✓	0.48	✗	0.29	✗	0.53	✗	0.92	✓
Deblur-NeRF	0.82	~	0.18	✗	0.42	✗	0.75	✗	1.00	✓
Average	0.76	~	0.28	✗	0.34	✗	0.67	✗	0.97	✓

Ablation Study

Every Component Matters

We systematically ablate each component of NERFIFY to quantify its contribution. The results confirm the importance of domain knowledge, compositional reasoning, iterative validation, and visual refinement.

Configuration	Score	Trainable (%)	Correct Novelties	PSNR
NERFIFY (Full)	0.98	100	1.00	27.16
Knowledge Sources
w/o In-context Examples	0.71	90	1.00	–
w/o Citation Recovery	0.68	100	0.65	–
w/o Both	0.58	90	0.65	23.22
Validation & Feedback
w/o Smoke Tests	0.69	60	0.85	–
w/o VLM Feedback + Smoke Tests	–	–	–	9.39
Planning Strategy
One-Shot (no GoT)	0.45	70	1.00	24.52

Removing smoke tests drops trainability to 60%. Disabling citation recovery causes 35% of novel techniques to be missed. One-shot generation collapses semantic score to 0.45 despite correct equation implementation, revealing failures in module boundary establishment.

Benchmark

NERFIFY-BENCH: 30 Papers, Four Categories

We contribute NERFIFY-BENCH, the first evaluation framework specifically designed for NeRF paper-to-code synthesis. Papers are curated to cover diverse architectural innovations, training strategies, and integration complexity.

Categorization of NeRF Papers by Integrability in Nerfstudio framework

Figure 7. Categorization of NeRF papers by integrability in Nerfstudio. Papers are grouped by implementation difficulty. Category 1: Directly integrable methods modifying architecture, rendering, or losses. Category 2: Methods requiring pretrained models (CLIP, diffusion, SLAM) with substantial engineering effort. Category 3: Out-of-scope works where NeRF serves different objectives.

Never-Implemented

No public code exists. Expert reimplementations serve as ground truth. Avoids LLM training data contamination.

Non-Nerfstudio

Public code exists but is not Nerfstudio-integrated, enabling direct comparison with original author implementations.

Nerfstudio-Integrated

Gold-standard Nerfstudio references for evaluating how well synthesis matches expert framework integration.

Novelty-Coverage

Papers with distinct technical contributions (novel losses, architectures, training strategies) to evaluate innovation capture.

Each benchmark entry includes the frozen PDF, dual markdown representations (raw and cleaned), LaTeX source code, and ground truth repositories where available. Evaluation spans executability metrics (build success, import resolution, training stability) and rendering quality metrics (PSNR, SSIM, LPIPS) across standard benchmarks.

NERFIFY: A Multi-Agent Framework for Turning NeRF Papers into Code