I missed the beginning of the story. Why and when does grokking occur? It seems to be a case of reaching a new basin, casting doubt on the shallow basin hypothesis in over-parameterized neural networks? The last I checked all the extrema in such models were supposed to be good, and easy to reach?
i've worked in this field for 6 years and have never heard of the 'shallow basin hypothesis', care to explain more? is it just the idea that there are many good solutions that can be reached in very different parts of parameter space?
all that grokking really means is that the 'correct', generalizable solution is often simpler than the overfit 'memorize all the datapoints' solution, so if you apply some sort of regularization to a model that you overfit, the regularization will make the memorized solution unstable and you will eventually tunnel over to the 'correct' solution
actual DNNs nowadays are usually not obviously overfit because they are trained on only one epoch
There's also a very interesting body of work on merging trained models, such as by interpolating between points in weight space, which relates to the concept of "basins" of similar solutions. Skim the intro of this if you're interested in learning more: https://arxiv.org/abs/2211.08403
Yes, you both understood what I meant. I just coined the term, having in mind illustrations like Fig. 1 in Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape (https://proceedings.mlr.press/v151/bisla22a.html)
cheers! i'm familiar with those first two papers, just not with the specific term. my intuition was more relatively deep points connected by tunnels than shallow basin - but it might just be the difficulty of describing high dimensional spaces