Could you place the work in context, and provide a simplified explanation for so...

duvenaud · on Dec 14, 2018

Sure thing. A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this:

  h1 = f1(x)
  h2 = f2(h1)
  h3 = f3(h2)
  h4 = f3(h3)
  y  = f5(h4)

They now build them like this:

  h1 = f1(x)  + x
  h2 = f2(h1) + h1
  h3 = f3(h2) + h2
  h4 = f4(h3) + h3
  y  = f5(h4) + h4

Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once.

In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods.

We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net.

Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time.

im3w1l · on Dec 14, 2018

So one question about that. In

  h1 = f1(x)  + x
  h2 = f2(h1) + h1
  h3 = f3(h2) + h2
  h4 = f4(h3) + h3
  y  = f5(h4) + h4

the functions are all different. But to see it as "a primitive ODE solver", then the functions should be the same?

So if I understand correctly you have a different take on RNN's but not on deep residual nets in general?

sampo · on Dec 14, 2018

> then the functions should be the same?

If we conceptually think that advancing from one neural net layer to the next one is the same as taking a time step with an ODE solver, then a bit more precise notation would be

    h1 = f(t=1,x)  + x
    h2 = f(t=2,h1) + h1
    h3 = f(t=3,h2) + h2
    h4 = f(t=4,h3) + h3
    y  = f(t=5,h4) + h4

Now you can say that the function f is always the same, but it still can give very different values for Δh when evaluated at different time points.

soVeryTired · on Dec 14, 2018

I do think it's misleading to compare the method to a general feed-forward network though, for two reasons.

First, to preserve the analogy between eq. 1 and 2, the thetas in equation two should have their own dynamics, which should be learned.

Second, even if Equation 1 doesn't allow it, in a general feed-forward network it's possible for the state to change dimension between layers. I don't see how that could happen with the continuous model.

Neat paper, but it'd be nice if they had tied the analogy more explicitly to RNNs in the introduction.

duvenaud · on Dec 14, 2018

The comparison we make is to residual networks, which I think is valid. First, we do parameterize a theta that changes with time, using a hypernet. But this is equivalent to the way sampo wrote the equations above - you can just feed time as another input to the dynamics network to get dynamics that change with time.

Second, I agree that general feedforward nets allow dimension changes, but resnets don't. This model is a drop-in replacement for resnets, but not for any feedforward net. If we gave the wrong impression somewhere, please let us know.

We didn't make the analogy with RNNs, because I don't think it fits - standard input-output RNNs have to take in part of the data with every time step, while here the data only appears at the input (depth 0) layer of the network.

soVeryTired · on Dec 15, 2018

You're absolutely right - sorry, somehow I managed to miss the explicit time parameter in your equation two, and didn't read carefully enough to see that you were restricting the discussion to resnets and normalising flows.

You might be able to make a better connection to RNNs by having the input data as a 'forcing' function in your ODE. But you probably need some regularity conditions on the input data to make sure the result is nicely behaved.

im3w1l · on Dec 15, 2018

> standard input-output RNNs have to take in part of the data with every time step

Well they can but I don't see they why they have to. And couldn't your network also take input and give output for all times?

> First, we do parameterize a theta that changes with time, using a hypernet.

Ah I see. Did you end up using it in the final model? I don't see that in the mnist example, but I could be missing it as I only skimmed the code.

joe_the_user · on Dec 14, 2018

Hasn't there been similar work to this in the past?

I don't see a "related work" section in your paper.

duvenaud · on Dec 14, 2018

Section 7 is titled "Related Work".

joe_the_user · on Dec 15, 2018

Thanks,

Apologies for not looking harder

magicalhippo · on Dec 14, 2018

Now it's been ages since I dabbled with neural nets so this might be completely silly, but can't a change in dimension be thought of as forcing the weights to/from certain nodes to be zero?

duvenaud · on Dec 14, 2018

Ah, that would stop certain dimensions from changing, but the output would still be the same size.

eximius · on Dec 15, 2018

While technically still the same size, I think he's proposing that it's, in a sense, isomorphic to a dimension change if the fix to zero is propogates throughout the remainder of the layers (until the next 'change' that is).

Or something like that.

magicalhippo · on Dec 15, 2018

I hadn't entirely fleshed out the idea, but yeah.

Take a simple NN with 3 layers: 5 neurons in the input layer, 3 in the hidden and 1 output.

Force inputs to neuron 4 and 5 in the hidden layer to be zero, and force inputs to neurons 2-5 to be zero in the output layer (and ignore their output). I'm assuming the transfer function obeys f(0) = 0, if not, fix output to zero as well.

My thought was this would be similar to how you enforce boundary conditions when solving partial differential equations by directly setting the value of certain matrix elements before running the solver.

Again, may be completely silly.

gundeep59 · on Dec 14, 2018

They model the above as dh(t)/dt, generalizing the discrete case (the equations you wrote) to a continuous case. Check the eq 2 in the paper. The statement following the equation makes it clear that "Starting from the input layer h(0), we can define the output layer h(T) to be the solution to this ODE initial value problem at some time T". Here, as per my understanding, h(0) can be the input itself. The function f mentioned in the eq 2 is the RNN cell.

spaced-out · on Dec 15, 2018

This paper explains the reasoning begins that

https://arxiv.org/abs/1512.03385

chombier · on Dec 15, 2018

> We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system?

What about symmetries of the underlying continuous system?

I'm under the impression that having deep nets as ODEs should make it possible to enforce a certain geometry on the information flow (like incompressible fluid, Hamiltonian, etc..) which would correspond to some invariant of the whole network.

Does this idea make sense?

thanatropism · on Dec 15, 2018

My dissertation was about energy- and symplecticity preserving methods for Hamiltonian ODEs. Try to find the book by Blanes or the one by Leimkuhler.

snrji · on Dec 14, 2018

Off-topic: could you tell us what degree did you study? What is your academic background?

duvenaud · on Dec 14, 2018

I did a CS undergrad at the University of Manitoba. Then took some time off to do a startup and was in the army reserves. Then when to UBC to do an MSc in CS + Stats. My Phd was officially in the made-up-sounding subject of "Information Engineering" at Cambridge, but really I just worked on Bayesian nonparametrics the whole time. I didn't start working on deep learning until my postdoc.

snrji · on Dec 15, 2018

Thanks!

Donald · on Dec 14, 2018

His background is available on his CV: http://www.cs.toronto.edu/~duvenaud/

snrji · on Dec 15, 2018

> His background is available on his CV: http://www.cs.toronto.edu/~duvenaud/

Thank you.

innagadadavida · on Dec 15, 2018

From a software standpoint, will any of these ideas be ported to TensorFlow or is this very different?