I missed the beginning of the story. Why and when does grokking occur? It seems ...

killerstorm · on June 3, 2024

IIRC it was observed in a training mode with weight decay. Perhaps a basin with proper generalization is more stable.

whimsicalism · on June 4, 2024

i've worked in this field for 6 years and have never heard of the 'shallow basin hypothesis', care to explain more? is it just the idea that there are many good solutions that can be reached in very different parts of parameter space?

all that grokking really means is that the 'correct', generalizable solution is often simpler than the overfit 'memorize all the datapoints' solution, so if you apply some sort of regularization to a model that you overfit, the regularization will make the memorized solution unstable and you will eventually tunnel over to the 'correct' solution

actual DNNs nowadays are usually not obviously overfit because they are trained on only one epoch

dontwearitout · on June 4, 2024

I haven't heard the term "shallow basin hypothesis" but I know what it refers to, these two papers spring to mind for me:

1) Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs https://arxiv.org/abs/1802.10026

2) Visualizing the Loss Landscape of Neural Nets https://arxiv.org/abs/1712.09913

There's also a very interesting body of work on merging trained models, such as by interpolating between points in weight space, which relates to the concept of "basins" of similar solutions. Skim the intro of this if you're interested in learning more: https://arxiv.org/abs/2211.08403

esafak · on June 4, 2024

Yes, you both understood what I meant. I just coined the term, having in mind illustrations like Fig. 1 in Low-Pass Filtering SGD for Recovering Flat Optima in the Deep Learning Optimization Landscape (https://proceedings.mlr.press/v151/bisla22a.html)

Reviewing the literature, I see the concept is more commonly referred to as "flat/wide minima"; e.g., https://www.pnas.org/doi/10.1073/pnas.1908636117

whimsicalism · on June 4, 2024

cheers! i'm familiar with those first two papers, just not with the specific term. my intuition was more relatively deep points connected by tunnels than shallow basin - but it might just be the difficulty of describing high dimensional spaces