Could you place the work in context, and provide a simplified explanation for someone who understands math and ML, but is not familiar with the literature on normalizing flows and autoencoders? Thanks!
I tried reading, but the abstract and introductory section were a little too terse for me :-)
Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.
In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.
We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.
Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.
If we conceptually think that advancing from one neural net layer to the next one is the same as taking a time step with an ODE solver, then a bit more precise notation would be
I do think it's misleading to compare the method to a general feed-forward network though, for two reasons.
First, to preserve the analogy between eq. 1 and 2, the thetas in equation two should have their own dynamics, which should be learned.
Second, even if Equation 1 doesn't allow it, in a general feed-forward network it's possible for the state to change dimension between layers. I don't see how that could happen with the continuous model.
Neat paper, but it'd be nice if they had tied the analogy more explicitly to RNNs in the introduction.
The comparison we make is to residual networks, which I think is valid. First, we do parameterize a theta that changes with time, using a hypernet. But this is equivalent to the way sampo wrote the equations above - you can just feed time as another input to the dynamics network to get dynamics that change with time.
Second, I agree that general feedforward nets allow dimension changes, but resnets don't. This model is a drop-in replacement for resnets, but not for any feedforward net. If we gave the wrong impression somewhere, please let us know.
We didn't make the analogy with RNNs, because I don't think it fits - standard input-output RNNs have to take in part of the data with every time step, while here the data only appears at the input (depth 0) layer of the network.
You're absolutely right - sorry, somehow I managed to miss the explicit time parameter in your equation two, and didn't read carefully enough to see that you were restricting the discussion to resnets and normalising flows.
You might be able to make a better connection to RNNs by having the input data as a 'forcing' function in your ODE. But you probably need some regularity conditions on the input data to make sure the result is nicely behaved.
Now it's been ages since I dabbled with neural nets so this might be completely silly, but can't a change in dimension be thought of as forcing the weights to/from certain nodes to be zero?
While technically still the same size, I think he's proposing that it's, in a sense, isomorphic to a dimension change if the fix to zero is propogates throughout the remainder of the layers (until the next 'change' that is).
Take a simple NN with 3 layers: 5 neurons in the input layer, 3 in the hidden and 1 output.
Force inputs to neuron 4 and 5 in the hidden layer to be zero, and force inputs to neurons 2-5 to be zero in the output layer (and ignore their output). I'm assuming the transfer function obeys f(0) = 0, if not, fix output to zero as well.
My thought was this would be similar to how you enforce boundary conditions when solving partial differential equations by directly setting the value of certain matrix elements before running the solver.
They model the above as dh(t)/dt, generalizing the discrete case (the equations you wrote) to a continuous case. Check the eq 2 in the paper.
The statement following the equation makes it clear that
"Starting from the input layer h(0), we can define the output layer h(T) to be the solution to this
ODE initial value problem at some time T".
Here, as per my understanding, h(0) can be the input itself. The function f mentioned in the eq 2 is the RNN cell.
> We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system?
What about symmetries of the underlying continuous system?
I'm under the impression that having deep nets as ODEs should make it possible to enforce a certain geometry on the information flow (like incompressible fluid, Hamiltonian, etc..) which would correspond to some invariant of the whole network.
I did a CS undergrad at the University of Manitoba. Then took some time off to do a startup and was in the army reserves. Then when to UBC to do an MSc in CS + Stats. My Phd was officially in the made-up-sounding subject of "Information Engineering" at Cambridge, but really I just worked on Bayesian nonparametrics the whole time. I didn't start working on deep learning until my postdoc.
I tried reading, but the abstract and introductory section were a little too terse for me :-)