NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Ben Mildenhall · Pratul P. Srinivasan · Matthew Tancik · Jonathan T. Barron · Ravi Ramamoorthi · Ren Ng
Abstract. We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x,y,z) and viewing direction (θ,φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
1Introduction
We address the long-standing problem of view synthesis by directly optimizing parameters of a continuous 5D scene representation to minimize the error of rendering a set of captured images. The representation is a multilayer perceptron (MLP) that maps each 5D input coordinate to a single volume density and view-dependent RGB color.
To render this neural radiance field from a particular viewpoint we use volume rendering techniques that are naturally differentiable, which lets us optimize through gradient descent on the residual between observed images and corresponding views rendered from our representation.
3Neural Radiance Field Scene Representation
We represent a continuous scene as a 5D vector-valued function whose input is a 3D location x = (x,y,z) and 2D viewing direction (θ,φ), and whose output is an emitted color c = (r,g,b) and volume density σ. In practice we approximate this continuous 5D scene representation with an MLP network FΘ: (x, d) → (c, σ) and optimize its weights Θ.
4Volume Rendering with Radiance Fields
The expected color C(r) of a camera ray r(t) = o + td with near and far bounds tn and tf is:
C(r) = ∫tntf T(t) σ(r(t)) c(r(t), d) dt (1)
where the function T(t) denotes the accumulated transmittance along the ray:
T(t) = exp(−∫tnt σ(r(s)) ds) (2)
5Optimizing a Neural Radiance Field
We introduce two improvements: (1) a positional encoding of the input coordinates that assists the MLP in representing higher frequency functions, and (2) a hierarchical sampling procedure.
5.1Positional Encoding
We reformulate FΘ = F'Θ ∘ γ:
γ(p) = (sin(2⁰πp), cos(2⁰πp), …, sin(2^(L−1)πp), cos(2^(L−1)πp)) (4)
This function γ(·) is applied separately to each of the three coordinate values in x (with L = 10) and to the three components of the Cartesian viewing direction unit vector d (with L = 4).
network architecture
5.2Hierarchical Volume Sampling
We simultaneously optimize two networks: one "coarse" and one "fine". We sample Nc = 64 locations using stratified sampling, and Nf = 128 additional locations from the coarse PDF.
5.3Implementation Details
Our loss is the total squared error between rendered and true pixel colors for both the coarse and fine renderings:
L = Σr∈R ‖Ĉc(r) − C(r)‖² + ‖Ĉf(r) − C(r)‖² (6)
We use a batch size of 4096 rays. We use the Adam optimizer with a learning rate that begins at 5×10⁻⁴ and decays exponentially to 5×10⁻⁵.
6Results
Average PSNR on the synthetic dataset reaches 31.01 dB, exceeding the next best method by 1.5 dB.
PSNR comparison
7Conclusion
We have proposed an effective new way to represent scenes as neural radiance fields, producing better renderings than discretized voxel approaches.
References
[1] Lombardi, S. et al. Neural Volumes. SIGGRAPH 2019.
[2] Sitzmann, V. et al. Scene Representation Networks. NeurIPS 2019.
[3] Tancik, M. et al. Fourier Features. NeurIPS 2020.
[4] Kingma, D. and Ba, J. Adam. ICLR 2015.