Since neural networks are universal, we can often solve problems we don't know exactly how to solve with neural networks. NeRF is a great example. But once solved, we should try to reverse engineer the solution and optimize.
https://arxiv.org/abs/2112.05131 did such work and found you can get NeRF quality without any neural network at all and as a result 100x faster.
Except the plenoxel representation is also 2-3 orders of magnitude larger than an MLP-based NeRF. It’s not very surprising that a sparse voxel representation can capture a plenoptic function. Representation of volumetric video further presses the size disadvantages of voxel based techniques.
The reasons for using deep learning function approximators are manifold, for instance in RL, state and state-action spaces quickly become too large for tabular methods. Using grids or tables also basically closes off the opportunities for exploiting meta-learning and analysis-by-synthesis.
Plenoxels also rely on explicit specification of a known grid structure, where the HyperNeRF method can learn latent parametric manifolds and handle dynamic objects with changing topologies.
An MLP-based NeRF actually has a comparable number of parameters to plenoxels (it's not 2-3 orders of magnitude smaller). The original NeRF is 8 dense layers with 256 channels and then this one adds another network with 6 dense layers with 64 channels, so roughly speaking 7x256x256 + 5x64x64. And remember that the voxel grid is sparse, though we don't get exact numbers here. We shouldn't be miserly with our megabytes in 2021. What concerns me is how HyperNeRF requires 64 hours of training time on 4 TPU v4s; if you want to use this for communication or entertainment, it's light-years away from interactive.
Extending plenoxels to support dynamic objects would be great future work.
You can’t really compare HyperNeRFs to Plenoxels because the former does so much more.
Vanilla NeRF has just under half a million weights (if we ignore the coarse network), although more recent work has shown comparable results with only four layers. To compare apples-to-apples, we should probably also apply the (hand-tuned) TVD regularization used in plenoxels.
Plenoxels store 28 parameters per grid point, so at a 256^3 grid size this would be just under half a billion parameters. You don’t know which voxels you can prune at the outset (btw, DNNs can be pruned as well), but for real scenes they show sparsity reaching ~5% after 12.8k iterations. For a real 360 deg scene, because the voxelized volume is quite bounded, you also need to consider the 64 layer multi-sphere image required to capture the background. Basically, you need a fairly large GPU card to train plenoxels on any kind of real scene.
The option to make this space-compute tradeoff is still useful.
NeRF triggered the development of a number of improved methods, though. There's DONeRF [1] that is built upon NNs and is currently faster than comparable solutions (the field evolves fast so I may be wrong).
By the way, in the plenoxel video they say "the key component is the differentiable volumetric rendering, not the neural network".
After watching the fractal community struggle to come up with good distance estimators for years, this makes so much sense. In the end, automatic differentiation came out to be one of the most solid methods for coming up with distance estimators for an arbitrary fractal formula.
What I find most exciting about it is that a NeRF represents images as neural nets, one neural net for each image (in the OP paper generalised to image + deformations). By evaluating the net at various pixel coordinates it gives the color.
Up until now learning to replicate the input exactly was called overfitting and considered a bug, not a feature, but they showed a completely new way to wield neural nets.
An interesting detail is that they depend on Fourier encoding for the input coordinates. A variant called SIREN uses `sin` as activation function throughout the net.
Maybe neural nets will become the data compressors of tomorrow? Shoot a picture, send a neural net around. Game assets could be NeRFs.
For text compression, there is of course the famous Hutter Prize, launched in 2006: https://en.wikipedia.org/wiki/Hutter_Prize ("Prediction is the golden key that opens all locks". Compressing each byte of wikipedia text is equivalent to predicting it-- to compactly represent its knowledge is to understand it.)
1. If the first byte is 0, insert the text of Wikipedia,
2. If it isn't, ignore it, and all further bytes are interpreted literally.
To avoid these sort of "joke" decompressors, they evaluate compressors on (size of compressed data) + (size of decompressor) in the compression competitions last time I checked. That means we won't get a winner based on GPT3 anytime soon. 350+ GB of weights is a lot to overcome :)
Though of course, given enough data to compress, it might well be that full-on neural language models are still worth it.
> That means we won't get a winner based on GPT3 anytime soon. 350+ GB of weights is a lot to overcome :)
You're right, and in-there lays the crucial difference between compression and language modelling. Models are concerned with having a good representation for both past and future distributions, while compressors only care about the past. Models support many tasks while compressors are just for input replication.
No, I don't quite agree. All compressors have a model, at least the general purpose ones. If you're writing self-extracting compression for your 8-bit demo, maybe you only care about the data you have right in front of you... But for anything general purpose, from DEFLATE to paq, there's an implicit, fixed model of future data in there (even if we sometimes prefer to see it as an "adaptive" model).
A good illustration of this is that you can quite easily use a general purpose compression algorithm for classification, and get surprisingly good performance.
I don't think it is unreasonable that both sender and receiver have the 350GB NN at each end. If for example we are talking video conferencing and compression of video data then only a small amount of data need to be transmitted real time and at each end a high fidelity image can be reconstructed.
> Up until now learning to replicate the input exactly was called overfitting and considered a bug, not a feature, but they showed a completely new way to wield neural nets.
Well, for that particular ting there was a predecessor of sort in Deep Image Priors from 2017.
That was all about overfitting a neural net on a single image, which they used to get impressive inpainting, noise removal and superresolution results without any training at all (though of course it did not beat state of the art training-based approaches, even then.)
I had a lot of fun playing around with it when it came. The idea is dead simple, within reach to implement yourself with no complex mathematical understanding.
> What I find most exciting about it is that a NeRF represents images as neural nets, one neural net for each image (in the OP paper generalised to image + deformations). By evaluating the net at various pixel coordinates it gives the color.
Wow. I only get the inner workings at a very basic, intuitive level, but it's really cool to see the progress of this and similar research. Congrats to the researchers.
It's awe-inspiring and even frightening at first, in the usual ways, but IMO it has a lot of long-term promise in other ways.
Spitballing: I like that this kind of result, which clearly calls into question the role or perception of physical identity, may eventually inform (or even necessitate?) the deconstruction of the physical "I" as a permission broker, and further open a many-to-many interface between the dimensions that underlay what we now think of as "self" and the true depth and variety within what we now think of as "individual humans who are not me". That opening process alone ought to be a huge jump for human development.
Right now we're each held, and holding ourselves, way too responsible for maintaining a singular subjective identity, looking at the aggregate. Not only does this compromise our outlook on others based on our subjective perception of the identity match, but it also compromises our ability to reliably consume and metabolize identity-construct-breaking information and experiences. And many of those things, when consumed without so many identity borders--so to speak--will end up being incredibly useful for individuals and group both.
I think the advent of DeepTomCruise [1] make us rethink the solidity of identity. A majority of those watching the videos appear to believe it is the real Tom Cruise, and really, there is no good way to tell any longer whether it is or not, without reference to external information.
There is no reason now that Tom Cruise even needs to exist as a real person, or needs to ever act in a movie ever again. Tom Cruise can just become an abstract concept, no longer a living object. Perhaps it is Tom Cruise himself in these videos. Perhaps the real Tom Cruise no longer exists. Perhaps the whole thing is an elaborate art project. Our certainty of its falsity is tied solely to whether we believe the story of those who claim to have created the videos. Is it easier to create fake videos of Tom Cruise or to create real videos of Tom Cruise and a fake story?
Hmm, that would go into specifics, which IMO are kind of tenuous from the start since the point of a spitball is to be open to unknowns.
So with that said, some ideas could be started around topics like 1) massive identity theft causing a re-thinking of identity 2) creativity and constraints around the moderation of physical identity and 3) technical-presentational dynamics surrounding physical presence and the moderation of identity presentation in a physical presence context.
Any one of these is a great setup for the question: How do we interpret personal identity?
And this--again just IMO--would be an amazing point at which to say, "look, if the only word-tool we can use is 'identity' to describe this crisis/opportunity, then maybe all we have is a metaphorical hammer and we have all these endless annoying nails--in the form of identity questions--to hammer down. But if we had maybe some other word-tools to use instead of 'identity', maybe this really would look more like an opportunity to move humanity one more step up the evolutionary ladder."
We already moderate our identity every day, either consciously or unconsciously. It's been studied for thousands of years. It's in books you've read, movies you've watched. It's been done for fun, for comic relief, and also it's been done to solve mind-shattering problems. But now we start to really unwind this question of physical identity, the one concrete thing we thought was so much more certain...! and things get _really_ interesting. This is a different level, where there's maybe not such a need to hide or hide from this departure from "this one idea of who I am" which is really just a mess of a complex of ideas.
> what uses do you anticipate for a more [fluid? porous? plural?] self-regard?
For one: More, and healthier, exposure to alternatives. Your identity is almost synonymous with your subjective past. To that degree, you're screwed in a lot of ways. To give a personal example, I was born into a cult. I was screwed from birth, in that way.
One of the best tools I had in removing myself from that environment was the concept of an "online identity" which could be moderated, intentionally, into whatever it needed to be to help me explore alternative perceptions of what it was I was involved in. I could even try on a non-cult identity, and write, online, from the perspective of someone who had freed themselves. And then I could consider how that felt, and reflect on what I learned. Did it kill me? No. Am I in hell now? Nope. etc.
Consider the millions of various points of identity just like that. Not just cults, no way! Am I Coke or Pepsi? eh, boring. Am I...which race am I? Is that a tricky question in the future? And from the outside, will I get better treatment from medical professionals if I can moderate my physical presentation at will? Wow so many random questions that can be asked for learning's sake.
But again, to emphasize--I love and respect the unknown. I don't have answers, only openness where I don't want to have certainty anymore, because that c-word makes it a little too hard to solve big problems, or a little too easy to avoid them. Don't leave the cult man, you'll lose all your certainty.
In the right circumstances, the NFL would spend $1E7-$1E8 on this or similar tech. It’s wild to think about how much of what we see on screens in a decade or so will be “computationally inferred”.
I'm sure there are also people who would love to computationally infer the preferred ending to an NFL game of their choice, too. Or change the ending to a movie of which they can only tolerate the first hour.
It would enable some really cool ideation and modeling, maybe even some of which could be used for psychology work, or sports psychology in the case of the NFL (I'm reminded of those "imagine yourself winning" tricks)
If you have a limited number of images of the same scene, with NeRF you can generate new images from different positions and angles (novel view synthesis).
But this only works with rigid scenes: e.g. if you apply NeRF to images of a person, they cannot move between the pictures.
This is what HyperNeRF is trying to solve. If there are pictures of a person, and in one of them they are smiling but on another not, 1. this method will not fail, and 2. looks like it will give reasonable new views/images.
For those of you who may not get this reference, it's from a popular Youtube channel, Two Minute Papers, run by Dr Károly Zsolnai-Fehér
that recently featured OP's link https://www.youtube.com/c/K%C3%A1rolyZsolnai/videos
The channel is excellent and I recommend subscribing to it if you like this kind of stuff.
In my opinion, everything nerf related gets a lot of opinion because it's highly graphical and thus easy to present. But there's few practical applications and it tends to be super slow and not work for more challenging scenes where traditional 20-year old methods like global penalty block matching still work reasonably.
And for this paper in particular, I fail to see how they improve over other nerf approaches with deformation terms like Nerfies or D-Nerf
Deformation fields would struggle to fundamentally change the topological type, particularly where the transformation would need to “tear” the manifold, such as turning a sphere into a donut or dividing a cell into two child cells. HyperNeRF’s exploit and extend Level Set Methods, which are rooted in Morse Theory.
The team photos double as the demo, that's neat. (Mouse over to see the depth colouring.)
I presume something along these lines will make it's way to the Pixel 6 camera software, given the origin of the research and the onboard edgeTPU block.
I was wondering on this. So it _is_ in the photogrammetry space? How soon before I can use my iPhone to take 6 pictures or something and get a perfect 3D model from it?
https://arxiv.org/abs/2112.05131 did such work and found you can get NeRF quality without any neural network at all and as a result 100x faster.