Why neural networks struggle with the Game of Life

317070 · on Sept 24, 2020

As someone who knows quite something on this topic, I do not really see what is the surprise here. Let's ignore the title (neural networks don't struggle, they just require a sufficiently large network) and go to the heart of the article, which is that neural networks need to be a lot larger than the most efficient solution to work.

What did people expect? That you could use gradient descent to find the optimal solution from a random solution? It would have been very surprising to me if that were the case. The reason neural networks work, is because they have so many solutions which are relatively close in parameter space (due to the high dimensionality) such that they can find a local optimum which is decent for most data.

To me, the surprising finding of the lottery ticket hypothesis was not that you can show that only a tiny part of the final network is necessary for the final solution, but that if you would have started with only that tiny part, you would retrieve almost the complete performance. I.e. that the initialisation is really important, perhaps more important than your architectural choices.

The Game of Life article is merely confirming the finding from the lottery ticket hypothesis paper, but then applied on the Game of Life. I am surprised it picked up so much attention, maybe the Game of Life is a sexy topic to illustrate the issue?

dcolkitt · on Sept 24, 2020

Over the past ten years, neural networks have been extremely successful on a range of problems that prior seemed impossibly difficult. One downside to that though is that people have forgotten the No Free Lunch theorem, and now see neural networks as a silver bullet to use everywhere and always.

NFL shows that no learning method works across the space of all possible problems. Every method bakes in a set of empirical priors. To the extent that the problem conforms to those priors, the method does well. But the contrast is that you must "pay the cost" by underperforming in other problems which contradict those priors.

To be fair often times the areas that you underperform in are so pathological that they're hardly ever encountered in the real world. Many regression type problems rely on the empirical prior that arbitrarily close real numbers are more similar than random. For example you expect the answer at 2.84394 to be closer 2.84395 than 150,000. In actual physical reality, this prior tends to hold up so well that we're barely even aware when we rely on it. But when it does break (like trying to treat Hamming distances as real distances), it breaks spectacularly. There's no free lunch, but sometimes we can pay the bill with cheap currency.

The point of all this, is that you shouldn't be picking learning methods without understanding the empirical priors embedded in your data. Just because a technique works really well on a hard problem, doesn't mean it works really well on all problems. Especially if they're two very different types of problems.

im3w1l · on Sept 24, 2020

The no-free-lunch theorem is interesting in theory but not terribly relevant in practice. A prior that something can be efficiently approximated will learn anything you could desire to learn, and even stronger assumptions will work for most real world problems.

Retric · on Sept 24, 2020

NFR is extremely important when you’re actually trying to solve specific problems. If the ideal solution involves some output randomness then neutral networks on their own are useless. Poker for example is a terrible fit.

red75prime · on Sept 24, 2020

It applies to humans too. We are terrible at generating truly random numbers, which are required for mixed game strategies.

BTW, NFL theorem proves that it itself is never terribly important. Learning algorithm, which uses NFL theorem for making decisions, are in the same position as any other learning algorithm.

Retric · on Sept 24, 2020

Sort of, the universe isn’t completely random so we have metadata we can apply to relevant problems. It’s the same reason that in theory compression algorithms fail, but in practice that’s not an issue.

Though for arbitrary theoretical problems it’s important.

gilbetron · on Sept 24, 2020

https://peekaboo-vision.blogspot.com/2019/07/dont-cite-no-fr...

chaboud · on Sept 24, 2020

I agree. I am utterly unsurprised that gradient descent was unable to optimize a “minimal” network to a complete solution of a binary state problem with a Boolean rule set.

I’d have been shocked if it went any other way.

I’m also reminded of an ‘06 paper by Kreinovich and Shpak, Aggregability is NP-Hard.

http://www.cs.utep.edu/vladik/2006/tr06-29.pdf

Neural networks aren’t magic, and optimization isn’t finding the one needle in a haystack with a blindfold on. This is an intuitive result for anyone mildly conversant in the space.

I don’t mind that they went out and did it, though.

dnautics · on Sept 24, 2020

Maybe I'm just not conversant enough in the field (I do ML "ops", not ML) but I'm surprised there aren't clever iterative ways of "reducing" the size of a model, for example, rank order the average activation of each parameter in a reducible group across the dataset, find the weakest, set it to zero, identify new misses, retrain with enrichment in the dead samples, reduce the hyperparameter by one.

TrainedMonkey · on Sept 24, 2020

There are: https://medium.com/syncedreview/mit-google-propose-automl-fo...

I went to a more recent talk by Google and results they handpicked to show were impressive.

317070 · on Sept 24, 2020

There are, it's called network pruning. You generally can reduce network size 90-95% without loss of accuracy with these methods.

ChrisRackauckas · on Sept 24, 2020

You "just" have to train a huge network first, which is the downside.

cjauvin · on Sept 24, 2020

From a very high level perspective, I think what is somewhat surprising here is the fact that given that the Game of Life rules are quite simple, it should follow that a powerful generalization mechanism (which is what deep neural networks are often described to be) should be able to easily derive them.

PaulHoule · on Sept 24, 2020

I see it as a practical yet fundamental result. Some people might dismiss it at first because it is around a "Mathematical Recreations" topic, but the problem of "learn the rules from the data" is a common one.

I've worked with autoencoder, convolutional and LSTM methods with text that appear to solve problems that they shouldn't be able to solve from

https://en.wikipedia.org/wiki/Variety_(cybernetics)#Law_of_r...

that is, they don't contain a correct structural representation of the problem space or they violate the laws of computer science.

For instance, a neural network might do a finite number of calculations to guess at a good next move and play a mean game of chess. This is an O(1) algorithm that can't possibly do things that would take an O(log N), O(N), O(N^2), ... algorithm to do it. (e.g. explore the game tree exhaustively and prove this is a winning move.) Put it in the context of a variable-runtime algorithm such as MCTS which can iterate over the network multiple times and it will crush a grandmaster.

Thus you run into the space of the halting problem, something that pretends to sort a list in O(N) time and gets away with it, ...

People in the field work with poorly defined problems, pick metrics ungrounded in practice, etc. In particular there is no requirement that the results of a neural network be 'correct' they just have to be better than some alternative.

If you have one of these things either working or almost-working you might get curious and try to set up some little experiment based on an intuition at the microscale (e.g. it's obvious how to hand-build the network, tasks the network couldn't possibly learn) and do experiments like this one where you can synthesize unlimited amounts of data...

... frequently you get lost at sea, suspect that the emperor has no clothes, wonder if you got the math wrong somewhere, might find your system performs worse the more training data you throw at it...

... thus these results don't close in a clear story, they don't get published.

watwatinthewat · on Sept 24, 2020

It depends what you mean by "easy" here, but if you define that as "throw a big enough network at it, and it'll figure it out on its own in training", the research cited shows that is the case.

The power you describe comes in part from network size (which is why NNs became more useful as we could size up as computing got more powerful/cheaper/more efficient). Small network means less powerful ability. The research showed smaller networks didn't work, but larger ones did. Seems like you'd agree?

horsawlarway · on Sept 24, 2020

The research showed that small networks did work, but they took manual intervention during training.

I think the whole point was that a small network is capable of solving this problem, but training from scratch without intervention only rarely produced a viable solution.

From the article

- But in most cases the trained neural network did not find the optimal solution, and the performance of the network decreased even further as the number of steps increased. The result of training the neural network was largely affected by the chosen set training examples as well as the initial parameters.

So a solution was clearly possible even in the smaller networks when trained from scratch, it just wasn't likely.

So I agree with your first point - The bigger networks did indeed "figure it out". But I don't agree at all with your second - The smaller networks weren't lacking the power to solve the problem, they were just unlikely to reach the right solution with the available training data.

ry454 · on Sept 25, 2020

I believe GoL is easily solvable by ML if the task was to find parameters of the update rules with the constraint that the rules are simple.

For example, if we were presented a highly sophisticated repetitive pattern and were tasked to figure the underlying tesselation rules that generate it, we would enumerate rulesets that we know already and then try to find parameters for them with the assumption that those parameters are simple.

Again, my point is that to solve GoL, the ML model needs to be given two things: 1) a few rulesets to choose from; 2) an upper limit on complexity of ruleset parameters.

ajuc · on Sept 24, 2020

The rules would need to be applied recursively for elegant generalization, I think? Do these networks have connections going backwards?

Veedrac · on Sept 25, 2020

The issue is that the network has to learn from an n-step transition, which is hard to do when minimally parameterized.

craftinator · on Sept 24, 2020

> To me, the surprising finding of the lottery ticket hypothesis was not only that you can show that only a tiny part of the final network is necessary for the final solution, but that if you would have started with only that tiny part, you would retrieve almost the complete performance.

As a developer who's dabbled in a few machine learning fields, I've always wondered if there was a way to strip all unused/unlikely paths out of a network to reduce size and increase performance. I've struggled with finding solid information on this (and just the whole ML tooling ecosystem in general). From your experience, are there any tools out there that can aid in this kind of analysis and path stripping?

blackbear_ · on Sept 24, 2020

> wondered if there was a way to strip all unused/unlikely paths out of a network to reduce size and increase performance.

Sure there is, it is called "pruning". You can easily find works on this with your favorite search engine. "Distillation" is another interesting technique to make models smaller.

These techniques are applied after training, to reduce the size of a big network. I don't know any method that can be applied _before_ training, just by inspecting the network.

harveywi · on Sept 24, 2020

You may be looking for "Optimal Brain Damage": http://yann.lecun.com/exdb/publis/pdf/lecun-90b.pdf

jmull · on Sept 24, 2020

> As someone who knows quite something on this topic, I do not really see what is the surprise here.

I'm not sure it was necessarily meant to be be surprising, especially to people experienced with neural networks. (In fact, it looks to me like the article is aimed at people who are curious but do not have a deep understanding.)

smusamashah · on Sept 24, 2020

Do you think GameGAN[0] will be able to pick the rules of game of life?

[0]: https://blogs.nvidia.com/blog/2020/05/22/gamegan-research-pa...

sollewitt · on Sept 25, 2020

So I only dabble, but what was surprising for me with GoL on a small grid was how the performance wasn't great even if I'd shown the network almost every possible state during training (omitting a test set).

Nets do such amazing things with open ended recognition problems ("what is in this picture") that you sort of just expect them to trivially learn closed form solutions perfectly. I know this is a fallacy, but that is where the surprise came for me.

auggierose · on Sept 24, 2020

I don't know much about it, but why is it surprising that if you initially put in the answer, you get a good answer out?

Shorel · on Sept 24, 2020

Yeah, in retrospect everything is so obvious. /facepalm

How could the Greek and Roman engineers not discover and use steam machines? They were so close! The machines are so obvious!

thaumasiotes · on Sept 24, 2020

You need a more specific question; they did discover and use steam machines.

But you probably have a particular use in mind. And once you realize that, you can ask "would steam engines have been helpful to the Greeks in this context?"

rini17 · on Sept 24, 2020

Their economy did not depend on deep mining and the need to pump water away from mines. This only happened over thousand years later.

scoot_718 · on Sept 25, 2020

My reading was they hand made a solution, called it optimal (no proof) and then were disappointed when the ANN didn't converge to their weights...

sillysaurusx · on Sept 24, 2020

This seems closely related to the problem of getting GPT-2 to play chess.

Vanilla neural networks can predict the next average move, which isn't necessarily the best move. If the model sees a lot of chess games, it will learn a lot of opening moves. But it'll get hopelessly lost within 13 moves.

What's needed is a formulation of self-play, i.e. to apply the AlphaZero / MuZero algorithm. With Game of Life, the game is: given the current board state, predict the next board state. You can have a simple rule like "start with 100 lives; every misprediction subtracts one life."

Since AlphaZero / MuZero plans with respect to a tree search, self-play should rapidly improve for Game of Life.

The crucial difference here is that you're training on a simulation, not examples. It's like the difference between walking around the real world vs seeing random photos of random places.

The alphazero algorithm is quite fun to implement, and somewhat understandable with a bit of effort: https://web.stanford.edu/~surag/posts/alphazero.html

317070 · on Sept 24, 2020

Honestly, turning a supervised learning problem into a reinforcement learning problem is an extremely wasteful way to burn a lot of electricity.

Supervised learning is orders of magnitude more efficient and better understood than reinforcement learning.

Reinforcement learning is an algorithm that can help you find unknown solutions to interact in an optimal way with a system. To predict the next board state, you know the solution already. So, this is a supervised learning problem.

sillysaurusx · on Sept 24, 2020

If that were true, there wouldn't be any reason to write this submission's article. But the article points out that it's not merely a supervised learning problem; the number of board states for a 768x768 board is 768x768xN, where N is the number of time steps. How would you gather labels for every possible board state? It would require 768 factorial labels to represent every possibility.

Consider the problem of face completion: https://scikit-learn.org/stable/auto_examples/miscellaneous/...

Given a set of faces, mask off random parts of the face, then try to fill in the holes.

Neural nets excel at that kind of task, because as you say, it's a supervised learning problem: you "know the solution" (the missing face) and you can penalize / reward the model so that it learns a correct face on average.

But game of life is nothing like that. Every board is a valid board. You can set the board to whatever values you want, and then say "Go!" and game of life will happily simulate it.

Therefore the neural network's goal isn't to learn the board state; it's to learn the rules, and to obey the rules. Hence, self-play seems like a reasonable formulation.

Without self-play, your learned network will always overfit or underfit different scenarios. E.g. even if it does well in general, a sparse board might throw it off. Or a board with a bunch of pieces in one corner. Or a board in a checkerboard pattern. But self-play will automatically handle all of these cases, because it's forced to learn the patterns over time.

Minesweeper is another example: it's possible to win every game using nothing but logic (disregarding tiebreakers), so therefore the solution is known. But I don't think it can be formulated as a supervised learning problem. There are too many board states, and the board states are related to each other via logic ("am I obeying the game state?") rather than atoms ("what does a human face look like?")

317070 · on Sept 24, 2020

I can try to address some of your arguments one by one:

> How would you gather labels for every possible board state? It would require 768 factorial labels to represent every possibility.

You don't need to represent every possibility. Just a sufficiently large dataset and a neural network with a sufficient amount of prior and size. Note that e.g. video continuation networks already exist and deals with comparable data sizes.

> But game of life is nothing like that. Every board is a valid board.

Yes, but given the previous board (or the board N steps ago) there is only one valid board. Hence, a supervised learning problem.

> Without self-play, your learned network will always overfit or underfit different scenarios.

This confuses me. Why do you think self-play avoids over- and underfitting? How could one approach even solve both over- and underfitting at the same time?

> Minesweeper is another example: it's possible to win every game using nothing but logic (disregarding tiebreakers), so therefore the solution is known. But I don't think it can be formulated as a supervised learning problem.

Sure it can. If the solution is known, just train a neural network to mimic the solution. That's what network distillation is all about.

Agentlien · on Sept 25, 2020

I've never heard of video continuation networks before and I would love to learn about it. Can you provide a link?

eloisius · on Sept 24, 2020

A pointless project I've wanted to do but don't have time is training GPT-2 to play Core War https://corewars.org/

zentiggr · on Sept 24, 2020

I haven't been involved in that for years, but I'd be very curious to see the (presumably) very non-intuitive code that could be generated.

linschn · on Sept 24, 2020

Self play applies to adversarial games, where both players are the same AI being trained (instead of AI VS humain or AI vs best known AI).

Game of life is not an adversarial game, there is no opponent, you can not apply self play to learn the rules.

rollcat · on Sept 24, 2020

So... I'm no AI expert, but I play some StarCraft 2, and I'm curious about how exactly Alpha Zero learns and why it's not applicable to some problems.

Contrary to go or chess, StarCraft includes asymmetric matchups (three very, very distinct races, so in 1v1 - six possible matchups). Every matchup is played very differently; one trivial example: if you'd try to play a Protoss vs Zerg match like you'd play a Protoss vs Protoss, you're going to lose really, really badly within minutes: the standard PvP opener (wall-off at the main base) is going to leave your natural expansion completely defenceless against a trivial early attack (mass zerglings), which snowballs into a very, very bad economical disadvantage.

So basically when AlphaStar learns, it absolutely must be really good at handling these kinds of asymmetric scenarios (both players/actors having a vastly different arsenal at their disposal, starting from t=0), otherwise it would lose every single non-mirror matchup (while it's actually beating many grandmasters).

Another peculiarity of StarCraft is e.g. with Zerg, as the race that normally opts in to play the very early game a bit blindly (late scout with a slow Overlord vs an earlier worker scout), but must still make some strategic choices, the most common (and optimal) being greed, which occasionally means losing to aggression or missing a chance to punish a greedy opponent. Humans here kind of rely on intuition, metagame, etc.

My question is, can the capabilities to handle the asymmetry, and to play blindly, be exploited by re-modeling a class of problems, such that they become "StarCraft-like" enough? E.g. (naively) if the Game of Life board is not "correct enough" by iteration N, the opponent will flood your base with zerglings ;)

nestorD · on Sept 24, 2020

Some people in the comments are missing the point, Neural network can be used to model the rules of the game of life (and the researchers proved it by finding a set of weights that does just that).

But, starting from a random position in weight space, gradient descent fails to find this optimum and requires a sensibly larger network to succed. Suggesting that the value of the initial weights is primordial (which is in line with the lottery ticket hypothesis)

rossdavidh · on Sept 24, 2020

I think that's true, but the headline implies that there's something special about the Game of Life, when really that's just a handy, widely-known example of a problem to learn. I don't think there's anything particularly unique about the GoL in this regard.

andybak · on Sept 24, 2020

Off the cuff thought on this (and I'll confess I've only skimmed the article).

The Game of Life is interesting to humans for the same reason it's hard for simple NN's to predict it. And it's for the same reason we find it hard to intuitively predict the output from GOF patterns withouts tons of exposure or by just running the rules through in our head like very inefficient computers.

rossdavidh · on Sept 24, 2020

Less exciting, more accurate title for the article: "Some topics, like the Game of Life, can be modeled with a small neural network if you get lucky, but normally require a large one".

An example of why I am not the one writing article headlines.

bawana · on Sept 25, 2020

I am trying to understand why they approached this ML problem this way- as a conv net. Conway's game of life is in itself a convolution. The state of a cell is calculated from the state of the 9 neighboring cells. When these authors made a neural net to learn these 'rules', they were trying to learn a simple convolution kernel from a bunch of macro states. They used a 32x32 grid board and 1 million training examples. It seems from their paper that each example consisted of a pair of states in the 'game of life'-the before state and the after state. But what were the labels? Was the 'after' state being used as a label for the before state?

When the problem is stated this way, it seems more akin to to predicting the next word in a sentence based on the present word. Wouldn't an LSTM or perhaps a GPT-3 architecture be better suited?

As you can see by the stupidity of my question, I am an ML amateur and neophyte. To be completely naive, how would one approach the problem of designing an ML algorithm to 'learn a convolution kernel' For example, suppose I had a million image pairs generated with a 3x3 'blur' kernel {(.0625,.125,.0625),(.125,.25,.125),(.0625,.125,.0625)} what architecture would be best to feed this data into?

vstrien · on Sept 24, 2020

Isn’t this “finding” (the fact that simple rulesets yield complex behavior that is hard to predict and/or derive using all kinds of models) the point that dr. Stephen Wolfram has been making for years?

(Beautiful illustration of the “simple rules” that yield complex behaviors: https://www.wolframphysics.org/visual-summary/dark/)

eternalban · on Sept 24, 2020

> the point that dr. Stephen Wolfram has been making for years?

Stephen Wolfram, along with many others (predating him):

An example. Note the dates. (A New Kind of Science is from 2002.)

http://csc.ucdavis.edu/~chaos/Publications/JPC-Publications....

Also see:

Chris Langton: https://en.wikipedia.org/wiki/Christopher_Langton

C. Langton, Studying arti􏱆cial life with cellular automata, Physica D 22 (1986) 120–149.

Complexity of Langton’s ant: https://arxiv.org/abs/nlin/0306022

pedrosbmartins · on Sept 24, 2020

> the fact that simple rulesets yield complex behavior that is hard to predict and/or derive using all kinds of models

I don't think this is the paper's "finding" at all. Sure, the Game of Life indicates that simple systems can result in complex, emergent behavior. But, exactly for being a simple set of rules, it is trivial to create a Game of Life implementation (or following this paper's definition, to solve the problem of predicting the nth next state, given the current state).

The findings are rather that neural networks, despite its universality and prediction power, have a hard time learning parameters that solve this trivial task using a minimal architecture, so they require a larger amount of parameters or a very good initial set of weights ("lottery ticket").

p1necone · on Sept 25, 2020

This seem very unsurprising to me. Neural networks are good at being "good enough" at "soft" problems where the answer isn't cut and dry even for humans. Strict logical rules where there's one right answer are not a good domain to try to solve probabilistically.

Shorel · on Sept 24, 2020

This is fascinating.

It's not about neural networks struggling with the game of life. It is much more subtle than that.

It's that the training algorithms are not able to reach the optimal network even if such network has already been found by hand.

But, it's also a great opportunity. The Game of Life is a simple enough algorithm that we can use it to fine tune NN training algorithms until they can start training with much better initial values for the weights.

pizza · on Sept 24, 2020

The fact of the matter is that bigger networks just often do a lot better. Much bigger networks, much more so. So I think, until some group of geniuses comes up with some radical optimization algorithm that rapidly trains massive sets of matrices, we're in an era of hardware-/bandwidth-/electricity-limited improvements.

I kinda think that's why LightOn is so interesting; they're trying to do neural networks by casting the matrix multiplies as doing optical stuff with photons (if I understand correctly, some kind of compressed sensing training method involving scattering randomized apertures for light).

I don't know much about their tech, but that seems like a paradigm shift compared to GPU cores.. they purport 3- Peta multiply-accumulate ops per second, at 30 watts. https://lighton.ai/our-technology/#1596017080086-e6716927-ba...

Would like to see the benchmarks though, their benchmarks repo doesn't just have it immediately all laid out

ccortes · on Sept 24, 2020

A bit un related but this reminded me of a question I had some months ago:

Let's say you train a bunch of ML models to solve a problem, these models are m_1, m_2,..., m_n and are ordered according to their performance on the tests/validation set where m_1 is the best model.

Should we expect to see regression to the mean on their predictions scores once we let them do their thing on fresh sets?

gwern · on Sept 24, 2020

Yes, because your validation set performance is imperfectly correlated with performance on more samples (because it's of finite size), so it will tend to overestimate within-distribution. There's also the problem that you probably want to deploy it on data which was not exactly collected the same way. So you have issues of both internal and external validity.

In practice, at least for CNN classifiers on ImageNet etc, it seems to be a pretty minor issue overall. Some links: https://arxiv.org/abs/1711.11561 https://arxiv.org/abs/1806.00451 https://arxiv.org/abs/1902.10811 http://gradientscience.org/data_rep_bias/ http://gradientscience.org/data_rep_bias.pdf https://arxiv.org/abs/1905.10498 https://arxiv.org/abs/2002.02559 https://arxiv.org/abs/2006.07159 https://arxiv.org/abs/1805.08974 https://arxiv.org/abs/1902.10178 https://arxiv.org/abs/1912.11370#google

heyitsguay · on Sept 24, 2020

Depends on how you do your validation, basically. It's easier to give an example if we assume m_1 is your worst model and m_n is your best: if you do a train/eval split, train m_1, eval m_1, make some tweaks to m_1 to improve that eval score and call the tweaked model m_2, etc., you'll implicitly be overfitting your better models to the eval data even though you're never actually training on it. In which case, yes, it is quite possible that given some new unrelated data, m_n will do no better than m_1 or m_2 or whatever.

If you do a train/eval/test data split instead, train+eval m_1 through m_n like before, and then rank them based on performance on the test data only after model selection is finished, you won't have that implicit overfitting to the test data, and m_n stands a better chance of beating m_1 on new datasets.

There's another way things can go wrong - if your train/eval/test data doesn't capture the full range of variability of the data source that will be producing your fresh datasets, then even if you do the train/eval/test split correctly, it's quite possible that m_n will do no better than m_1 when operating on fresh data outside of the train/eval/test distribution.

Basically, it is possible to build a sequence of models that actually improve performance on new data, but there are also lots of ways to get it wrong and produce models that look like they're doing better in a controlled environment, but fail to do better in the wild.

david2ndaccount · on Sept 24, 2020

Yes, but this is true of anything when you estimate some quality and pick the “best” from a group. See the winner’s curse.

Shorel · on Sept 24, 2020

Sounds like halfway to a genetic algorithm to train the networks.

angel_j · on Sept 24, 2020

IMHO, GoL is too simple (as Einstein might have said). The machinery GoL can produce is "curious", but overly complicated. IOW, GoL is not complex enough to produce results that can be reasoned about usefully/meaningfully.

They used CNNs in this experiment to learn how far a NN could predict GoL into the future. It can't be done, for GoL is not finite.

nicholast · on Sept 24, 2020

Trying to put on my Stephen Wolfram hat here and expect he would refer to this as an example of computational irreducibility.

mpoteat · on Sept 24, 2020

I mean, deciding whether or not an arbitrary starting configuration loops back on itself is undecidable. So some problems are out of the reach of a finite sized network.

mensetmanusman · on Sept 24, 2020

Do researchers believe that the lottery hypothesis would also apply to the ML algorithm for driverless cars?

E.g. would it be possible for some programmer at Tesla to accidentally find the initial state variables that converges to a fully functioning self-driving algorithm years in advance of the algorithm working through data acquisition?

pas · on Sept 25, 2020

Sure it applies, but it's even less likely than typing in one of Google's 4096-bit RSA key by chance.

mensetmanusman · on Sept 25, 2020

So the lottery is so unlikely it is not worth thinking about? Are there bounds on the possibilities? Can’t you just check a lot of values and hope for the best?

pas · on Sept 25, 2020

Umm, on the contrary. It's worth thinking about, but not to directly solve it, but to help the training process converge faster. So if we can increase the chances by many orders of magnitude, then it'll help a lot with training.

Basically it's still an open question of exactly how and why the training process can get stuck. It's almost as if the network gets sufficiently strong ideas (Bayesian priors) it starts to discard new evidence (the error signal doesn't help). That's probably why a maximum entropy (random) state is useful to "bootstrap" those efficient subnetworks that then learn fast.

BrandoElFollito · on Sept 24, 2020

I have a hard time understanding what they are trying to make the network learn.

The rules? (i.e. how a cell behaves in the next round depending on its neighborhood)

The evolution of a specific initial setup? (i.e. towards which of the known solutions the network will converge, given its state)

pas · on Sept 25, 2020

Yes, the rules. Basically to encode the rules of GoL into the convolutional matrices.

em3rgent0rdr · on Sept 25, 2020

In fairness, most humans would probably struggle with determining how the Game of Life works, probably because they may fail to grasp that it only depends on cell neighborhood interaction rules.

im3w1l · on Sept 25, 2020

Iirc even for the simple toy problem of learning the xor of two bits, the network should be overparameterised.

darepublic · on Sept 24, 2020

reminds me of someone jokingly implementing a neural net to solve fizzbuzz on an interview (I think it was a fictional account)

Ivoah · on Sept 24, 2020

https://joelgrus.com/2016/05/23/fizz-buzz-in-tensorflow/