As someone who knows quite something on this topic, I do not really see what is the surprise here. Let's ignore the title (neural networks don't struggle, they just require a sufficiently large network) and go to the heart of the article, which is that neural networks need to be a lot larger than the most efficient solution to work.
What did people expect? That you could use gradient descent to find the optimal solution from a random solution? It would have been very surprising to me if that were the case. The reason neural networks work, is because they have so many solutions which are relatively close in parameter space (due to the high dimensionality) such that they can find a local optimum which is decent for most data.
To me, the surprising finding of the lottery ticket hypothesis was not that you can show that only a tiny part of the final network is necessary for the final solution, but that if you would have started with only that tiny part, you would retrieve almost the complete performance. I.e. that the initialisation is really important, perhaps more important than your architectural choices.
The Game of Life article is merely confirming the finding from the lottery ticket hypothesis paper, but then applied on the Game of Life. I am surprised it picked up so much attention, maybe the Game of Life is a sexy topic to illustrate the issue?
Over the past ten years, neural networks have been extremely successful on a range of problems that prior seemed impossibly difficult. One downside to that though is that people have forgotten the No Free Lunch theorem, and now see neural networks as a silver bullet to use everywhere and always.
NFL shows that no learning method works across the space of all possible problems. Every method bakes in a set of empirical priors. To the extent that the problem conforms to those priors, the method does well. But the contrast is that you must "pay the cost" by underperforming in other problems which contradict those priors.
To be fair often times the areas that you underperform in are so pathological that they're hardly ever encountered in the real world. Many regression type problems rely on the empirical prior that arbitrarily close real numbers are more similar than random. For example you expect the answer at 2.84394 to be closer 2.84395 than 150,000. In actual physical reality, this prior tends to hold up so well that we're barely even aware when we rely on it. But when it does break (like trying to treat Hamming distances as real distances), it breaks spectacularly. There's no free lunch, but sometimes we can pay the bill with cheap currency.
The point of all this, is that you shouldn't be picking learning methods without understanding the empirical priors embedded in your data. Just because a technique works really well on a hard problem, doesn't mean it works really well on all problems. Especially if they're two very different types of problems.
The no-free-lunch theorem is interesting in theory but not terribly relevant in practice. A prior that something can be efficiently approximated will learn anything you could desire to learn, and even stronger assumptions will work for most real world problems.
NFR is extremely important when you’re actually trying to solve specific problems. If the ideal solution involves some output randomness then neutral networks on their own are useless. Poker for example is a terrible fit.
It applies to humans too. We are terrible at generating truly random numbers, which are required for mixed game strategies.
BTW, NFL theorem proves that it itself is never terribly important. Learning algorithm, which uses NFL theorem for making decisions, are in the same position as any other learning algorithm.
Sort of, the universe isn’t completely random so we have metadata we can apply to relevant problems. It’s the same reason that in theory compression algorithms fail, but in practice that’s not an issue.
Though for arbitrary theoretical problems it’s important.
I agree. I am utterly unsurprised that gradient descent was unable to optimize a “minimal” network to a complete solution of a binary state problem with a Boolean rule set.
I’d have been shocked if it went any other way.
I’m also reminded of an ‘06 paper by Kreinovich and Shpak, Aggregability is NP-Hard.
Neural networks aren’t magic, and optimization isn’t finding the one needle in a haystack with a blindfold on. This is an intuitive result for anyone mildly conversant in the space.
I don’t mind that they went out and did it, though.
Maybe I'm just not conversant enough in the field (I do ML "ops", not ML) but I'm surprised there aren't clever iterative ways of "reducing" the size of a model, for example, rank order the average activation of each parameter in a reducible group across the dataset, find the weakest, set it to zero, identify new misses, retrain with enrichment in the dead samples, reduce the hyperparameter by one.
From a very high level perspective, I think what is somewhat surprising here is the fact that given that the Game of Life rules are quite simple, it should follow that a powerful generalization mechanism (which is what deep neural networks are often described to be) should be able to easily derive them.
I see it as a practical yet fundamental result. Some people might dismiss it at first because it is around a "Mathematical Recreations" topic, but the problem of "learn the rules from the data" is a common one.
I've worked with autoencoder, convolutional and LSTM methods with text that appear to solve problems that they shouldn't be able to solve from
that is, they don't contain a correct structural representation of the problem space or they violate the laws of computer science.
For instance, a neural network might do a finite number of calculations to guess at a good next move and play a mean game of chess. This is an O(1) algorithm that can't possibly do things that would take an O(log N), O(N), O(N^2), ... algorithm to do it. (e.g. explore the game tree exhaustively and prove this is a winning move.) Put it in the context of a variable-runtime algorithm such as MCTS which can iterate over the network multiple times and it will crush a grandmaster.
Thus you run into the space of the halting problem, something that pretends to sort a list in O(N) time and gets away with it, ...
People in the field work with poorly defined problems, pick metrics ungrounded in practice, etc. In particular there is no requirement that the results of a neural network be 'correct' they just have to be better than some alternative.
If you have one of these things either working or almost-working you might get curious and try to set up some little experiment based on an intuition at the microscale (e.g. it's obvious how to hand-build the network, tasks the network couldn't possibly learn) and do experiments like this one where you can synthesize unlimited amounts of data...
... frequently you get lost at sea, suspect that the emperor has no clothes, wonder if you got the math wrong somewhere, might find your system performs worse the more training data you throw at it...
... thus these results don't close in a clear story, they don't get published.
It depends what you mean by "easy" here, but if you define that as "throw a big enough network at it, and it'll figure it out on its own in training", the research cited shows that is the case.
The power you describe comes in part from network size (which is why NNs became more useful as we could size up as computing got more powerful/cheaper/more efficient). Small network means less powerful ability. The research showed smaller networks didn't work, but larger ones did. Seems like you'd agree?
The research showed that small networks did work, but they took manual intervention during training.
I think the whole point was that a small network is capable of solving this problem, but training from scratch without intervention only rarely produced a viable solution.
From the article
- But in most cases the trained neural network did not find the optimal solution, and the performance of the network decreased even further as the number of steps increased. The result of training the neural network was largely affected by the chosen set training examples as well as the initial parameters.
So a solution was clearly possible even in the smaller networks when trained from scratch, it just wasn't likely.
So I agree with your first point - The bigger networks did indeed "figure it out". But I don't agree at all with your second - The smaller networks weren't lacking the power to solve the problem, they were just unlikely to reach the right solution with the available training data.
I believe GoL is easily solvable by ML if the task was to find parameters of the update rules with the constraint that the rules are simple.
For example, if we were presented a highly sophisticated repetitive pattern and were tasked to figure the underlying tesselation rules that generate it, we would enumerate rulesets that we know already and then try to find parameters for them with the assumption that those parameters are simple.
Again, my point is that to solve GoL, the ML model needs to be given two things: 1) a few rulesets to choose from; 2) an upper limit on complexity of ruleset parameters.
> To me, the surprising finding of the lottery ticket hypothesis was not only that you can show that only a tiny part of the final network is necessary for the final solution, but that if you would have started with only that tiny part, you would retrieve almost the complete performance.
As a developer who's dabbled in a few machine learning fields, I've always wondered if there was a way to strip all unused/unlikely paths out of a network to reduce size and increase performance. I've struggled with finding solid information on this (and just the whole ML tooling ecosystem in general). From your experience, are there any tools out there that can aid in this kind of analysis and path stripping?
> wondered if there was a way to strip all unused/unlikely paths out of a network to reduce size and increase performance.
Sure there is, it is called "pruning". You can easily find works on this with your favorite search engine. "Distillation" is another interesting technique to make models smaller.
These techniques are applied after training, to reduce the size of a big network. I don't know any method that can be applied _before_ training, just by inspecting the network.
> As someone who knows quite something on this topic, I do not really see what is the surprise here.
I'm not sure it was necessarily meant to be be surprising, especially to people experienced with neural networks. (In fact, it looks to me like the article is aimed at people who are curious but do not have a deep understanding.)
So I only dabble, but what was surprising for me with GoL on a small grid was how the performance wasn't great even if I'd shown the network almost every possible state during training (omitting a test set).
Nets do such amazing things with open ended recognition problems ("what is in this picture") that you sort of just expect them to trivially learn closed form solutions perfectly. I know this is a fallacy, but that is where the surprise came for me.
You need a more specific question; they did discover and use steam machines.
But you probably have a particular use in mind. And once you realize that, you can ask "would steam engines have been helpful to the Greeks in this context?"
This seems closely related to the problem of getting GPT-2 to play chess.
Vanilla neural networks can predict the next average move, which isn't necessarily the best move. If the model sees a lot of chess games, it will learn a lot of opening moves. But it'll get hopelessly lost within 13 moves.
What's needed is a formulation of self-play, i.e. to apply the AlphaZero / MuZero algorithm. With Game of Life, the game is: given the current board state, predict the next board state. You can have a simple rule like "start with 100 lives; every misprediction subtracts one life."
Since AlphaZero / MuZero plans with respect to a tree search, self-play should rapidly improve for Game of Life.
The crucial difference here is that you're training on a simulation, not examples. It's like the difference between walking around the real world vs seeing random photos of random places.
Honestly, turning a supervised learning problem into a reinforcement learning problem is an extremely wasteful way to burn a lot of electricity.
Supervised learning is orders of magnitude more efficient and better understood than reinforcement learning.
Reinforcement learning is an algorithm that can help you find unknown solutions to interact in an optimal way with a system. To predict the next board state, you know the solution already. So, this is a supervised learning problem.
If that were true, there wouldn't be any reason to write this submission's article. But the article points out that it's not merely a supervised learning problem; the number of board states for a 768x768 board is 768x768xN, where N is the number of time steps. How would you gather labels for every possible board state? It would require 768 factorial labels to represent every possibility.
Given a set of faces, mask off random parts of the face, then try to fill in the holes.
Neural nets excel at that kind of task, because as you say, it's a supervised learning problem: you "know the solution" (the missing face) and you can penalize / reward the model so that it learns a correct face on average.
But game of life is nothing like that. Every board is a valid board. You can set the board to whatever values you want, and then say "Go!" and game of life will happily simulate it.
Therefore the neural network's goal isn't to learn the board state; it's to learn the rules, and to obey the rules. Hence, self-play seems like a reasonable formulation.
Without self-play, your learned network will always overfit or underfit different scenarios. E.g. even if it does well in general, a sparse board might throw it off. Or a board with a bunch of pieces in one corner. Or a board in a checkerboard pattern. But self-play will automatically handle all of these cases, because it's forced to learn the patterns over time.
Minesweeper is another example: it's possible to win every game using nothing but logic (disregarding tiebreakers), so therefore the solution is known. But I don't think it can be formulated as a supervised learning problem. There are too many board states, and the board states are related to each other via logic ("am I obeying the game state?") rather than atoms ("what does a human face look like?")
I can try to address some of your arguments one by one:
> How would you gather labels for every possible board state? It would require 768 factorial labels to represent every possibility.
You don't need to represent every possibility. Just a sufficiently large dataset and a neural network with a sufficient amount of prior and size. Note that e.g. video continuation networks already exist and deals with comparable data sizes.
> But game of life is nothing like that. Every board is a valid board.
Yes, but given the previous board (or the board N steps ago) there is only one valid board. Hence, a supervised learning problem.
> Without self-play, your learned network will always overfit or underfit different scenarios.
This confuses me. Why do you think self-play avoids over- and underfitting? How could one approach even solve both over- and underfitting at the same time?
> Minesweeper is another example: it's possible to win every game using nothing but logic (disregarding tiebreakers), so therefore the solution is known. But I don't think it can be formulated as a supervised learning problem.
Sure it can. If the solution is known, just train a neural network to mimic the solution. That's what network distillation is all about.
So... I'm no AI expert, but I play some StarCraft 2, and I'm curious about how exactly Alpha Zero learns and why it's not applicable to some problems.
Contrary to go or chess, StarCraft includes asymmetric matchups (three very, very distinct races, so in 1v1 - six possible matchups). Every matchup is played very differently; one trivial example: if you'd try to play a Protoss vs Zerg match like you'd play a Protoss vs Protoss, you're going to lose really, really badly within minutes: the standard PvP opener (wall-off at the main base) is going to leave your natural expansion completely defenceless against a trivial early attack (mass zerglings), which snowballs into a very, very bad economical disadvantage.
So basically when AlphaStar learns, it absolutely must be really good at handling these kinds of asymmetric scenarios (both players/actors having a vastly different arsenal at their disposal, starting from t=0), otherwise it would lose every single non-mirror matchup (while it's actually beating many grandmasters).
Another peculiarity of StarCraft is e.g. with Zerg, as the race that normally opts in to play the very early game a bit blindly (late scout with a slow Overlord vs an earlier worker scout), but must still make some strategic choices, the most common (and optimal) being greed, which occasionally means losing to aggression or missing a chance to punish a greedy opponent. Humans here kind of rely on intuition, metagame, etc.
My question is, can the capabilities to handle the asymmetry, and to play blindly, be exploited by re-modeling a class of problems, such that they become "StarCraft-like" enough? E.g. (naively) if the Game of Life board is not "correct enough" by iteration N, the opponent will flood your base with zerglings ;)
Some people in the comments are missing the point, Neural network can be used to model the rules of the game of life (and the researchers proved it by finding a set of weights that does just that).
But, starting from a random position in weight space, gradient descent fails to find this optimum and requires a sensibly larger network to succed.
Suggesting that the value of the initial weights is primordial (which is in line with the lottery ticket hypothesis)
I think that's true, but the headline implies that there's something special about the Game of Life, when really that's just a handy, widely-known example of a problem to learn. I don't think there's anything particularly unique about the GoL in this regard.
Off the cuff thought on this (and I'll confess I've only skimmed the article).
The Game of Life is interesting to humans for the same reason it's hard for simple NN's to predict it. And it's for the same reason we find it hard to intuitively predict the output from GOF patterns withouts tons of exposure or by just running the rules through in our head like very inefficient computers.
Less exciting, more accurate title for the article: "Some topics, like the Game of Life, can be modeled with a small neural network if you get lucky, but normally require a large one".
An example of why I am not the one writing article headlines.
I am trying to understand why they approached this ML problem this way- as a conv net. Conway's game of life is in itself a convolution. The state of a cell is calculated from the state of the 9 neighboring cells. When these authors made a neural net to learn these 'rules', they were trying to learn a simple convolution kernel from a bunch of macro states. They used a 32x32 grid board and 1 million training examples. It seems from their paper that each example consisted of a pair of states in the 'game of life'-the before state and the after state. But what were the labels? Was the 'after' state being used as a label for the before state?
When the problem is stated this way, it seems more akin to to predicting the next word in a sentence based on the present word. Wouldn't an LSTM or perhaps a GPT-3 architecture be better suited?
As you can see by the stupidity of my question, I am an ML amateur and neophyte.
To be completely naive, how would one approach the problem of designing an ML algorithm to 'learn a convolution kernel' For example, suppose I had a million image pairs generated with a 3x3 'blur' kernel
{(.0625,.125,.0625),(.125,.25,.125),(.0625,.125,.0625)}
what architecture would be best to feed this data into?
Isn’t this “finding” (the fact that simple rulesets yield complex behavior that is hard to predict and/or derive using all kinds of models) the point that dr. Stephen Wolfram has been making for years?
> the fact that simple rulesets yield complex behavior that is hard to predict and/or derive using all kinds of models
I don't think this is the paper's "finding" at all. Sure, the Game of Life indicates that simple systems can result in complex, emergent behavior. But, exactly for being a simple set of rules, it is trivial to create a Game of Life implementation (or following this paper's definition, to solve the problem of predicting the nth next state, given the current state).
The findings are rather that neural networks, despite its universality and prediction power, have a hard time learning parameters that solve this trivial task using a minimal architecture, so they require a larger amount of parameters or a very good initial set of weights ("lottery ticket").
This seem very unsurprising to me. Neural networks are good at being "good enough" at "soft" problems where the answer isn't cut and dry even for humans. Strict logical rules where there's one right answer are not a good domain to try to solve probabilistically.
It's not about neural networks struggling with the game of life. It is much more subtle than that.
It's that the training algorithms are not able to reach the optimal network even if such network has already been found by hand.
But, it's also a great opportunity. The Game of Life is a simple enough algorithm that we can use it to fine tune NN training algorithms until they can start training with much better initial values for the weights.
The fact of the matter is that bigger networks just often do a lot better. Much bigger networks, much more so. So I think, until some group of geniuses comes up with some radical optimization algorithm that rapidly trains massive sets of matrices, we're in an era of hardware-/bandwidth-/electricity-limited improvements.
I kinda think that's why LightOn is so interesting; they're trying to do neural networks by casting the matrix multiplies as doing optical stuff with photons (if I understand correctly, some kind of compressed sensing training method involving scattering randomized apertures for light).
A bit un related but this reminded me of a question I had some months ago:
Let's say you train a bunch of ML models to solve a problem, these models are m_1, m_2,..., m_n and are ordered according to their performance on the tests/validation set where m_1 is the best model.
Should we expect to see regression to the mean on their predictions scores once we let them do their thing on fresh sets?
Yes, because your validation set performance is imperfectly correlated with performance on more samples (because it's of finite size), so it will tend to overestimate within-distribution. There's also the problem that you probably want to deploy it on data which was not exactly collected the same way. So you have issues of both internal and external validity.
Depends on how you do your validation, basically. It's easier to give an example if we assume m_1 is your worst model and m_n is your best: if you do a train/eval split, train m_1, eval m_1, make some tweaks to m_1 to improve that eval score and call the tweaked model m_2, etc., you'll implicitly be overfitting your better models to the eval data even though you're never actually training on it. In which case, yes, it is quite possible that given some new unrelated data, m_n will do no better than m_1 or m_2 or whatever.
If you do a train/eval/test data split instead, train+eval m_1 through m_n like before, and then rank them based on performance on the test data only after model selection is finished, you won't have that implicit overfitting to the test data, and m_n stands a better chance of beating m_1 on new datasets.
There's another way things can go wrong - if your train/eval/test data doesn't capture the full range of variability of the data source that will be producing your fresh datasets, then even if you do the train/eval/test split correctly, it's quite possible that m_n will do no better than m_1 when operating on fresh data outside of the train/eval/test distribution.
Basically, it is possible to build a sequence of models that actually improve performance on new data, but there are also lots of ways to get it wrong and produce models that look like they're doing better in a controlled environment, but fail to do better in the wild.
IMHO, GoL is too simple (as Einstein might have said). The machinery GoL can produce is "curious", but overly complicated. IOW, GoL is not complex enough to produce results that can be reasoned about usefully/meaningfully.
They used CNNs in this experiment to learn how far a NN could predict GoL into the future. It can't be done, for GoL is not finite.
I mean, deciding whether or not an arbitrary starting configuration loops back on itself is undecidable. So some problems are out of the reach of a finite sized network.
Do researchers believe that the lottery hypothesis would also apply to the ML algorithm for driverless cars?
E.g. would it be possible for some programmer at Tesla to accidentally find the initial state variables that converges to a fully functioning self-driving algorithm years in advance of the algorithm working through data acquisition?
So the lottery is so unlikely it is not worth thinking about? Are there bounds on the possibilities? Can’t you just check a lot of values and hope for the best?
Umm, on the contrary. It's worth thinking about, but not to directly solve it, but to help the training process converge faster. So if we can increase the chances by many orders of magnitude, then it'll help a lot with training.
Basically it's still an open question of exactly how and why the training process can get stuck. It's almost as if the network gets sufficiently strong ideas (Bayesian priors) it starts to discard new evidence (the error signal doesn't help). That's probably why a maximum entropy (random) state is useful to "bootstrap" those efficient subnetworks that then learn fast.
In fairness, most humans would probably struggle with determining how the Game of Life works, probably because they may fail to grasp that it only depends on cell neighborhood interaction rules.
What did people expect? That you could use gradient descent to find the optimal solution from a random solution? It would have been very surprising to me if that were the case. The reason neural networks work, is because they have so many solutions which are relatively close in parameter space (due to the high dimensionality) such that they can find a local optimum which is decent for most data.
To me, the surprising finding of the lottery ticket hypothesis was not that you can show that only a tiny part of the final network is necessary for the final solution, but that if you would have started with only that tiny part, you would retrieve almost the complete performance. I.e. that the initialisation is really important, perhaps more important than your architectural choices.
The Game of Life article is merely confirming the finding from the lottery ticket hypothesis paper, but then applied on the Game of Life. I am surprised it picked up so much attention, maybe the Game of Life is a sexy topic to illustrate the issue?