LoRA Learns Less and Forgets Less

thepasswordis · on May 17, 2024

I really wish people would be more careful about choosing names for these things.

LoRa has been a popular wireless protocol for like 10 years.

onion2k · on May 18, 2024

The scale of human knowledge, and the pace we're increasing it at, means we probably can't have unique names for everything any more if we aim for short, pronounceable things. "Lora" must have dozens of meanings across different domains. The fact you recognize it from another one is just something you have to deal with.

sva_ · on May 17, 2024

Yes, but this is LoRA, clearly not LoRa.

squarefoot · on May 17, 2024

PP has a point though. I entered "LoRA" on Google, DuckDuckGo, Startpage and Bing, and all returned results in all first pages were about the communication protocol (1). They could have inferred my interests from previous searches, but I never used Bing in the last year or so, so it seems to me someone didn't care about name clashes.

(1) well, except Google which -surprise- returned about mid page an ad of a local quite expensive chandeliers brand called "LORA".

sva_ · on May 17, 2024

I usually just add a term like 'ml' or 'nn' after my search to give the machine context and it is sufficient in most cases.

johnisgood · on May 18, 2024

Most of the time you have to add a keyword as to what it is related. We cannot expect everything to have unique names, unless we are perfectly fine with random pronounceable strings as names.

refulgentis · on May 18, 2024

Wait until you find out its a name too

atherton33 · on May 17, 2024

I think you mean LoRa®, a registered trademark of Semtech® Corporation, for use only with permission and within specific guidelines. https://www.semtech.com/uploads/company/FAQ-for-Use-of-LoRa-...

mbirth · on May 18, 2024

Not to be confused with Semtex who sell a completely different kind of problem solver.

noisy_boy · on May 18, 2024

Isn't Semtex an explosive aka RDX?

yau8edq12i · on May 18, 2024

Yes, you got the joke.

BaculumMeumEst · on May 18, 2024

The other side of this is that if you become paralyzed by decisions because 0.001% of people are bothered by it, you're not gonna make it.

renewiltord · on May 17, 2024

Seriously, that was a terrible name for the wireless system since it's been used by the Loyola Online Records Access system for half a decade or more before the radio company shamelessly copied the name.

mobilemidget · on May 17, 2024

I addressed the same in a previous 'lora' post on HN. For me the name is already reserved for the radio telecommunication meaning. Nothing going to change that.

Turing_Machine · on May 18, 2024

Think of Apple naming its spreadsheet "Numbers" and its word processor "Pages" (or, for that matter, the name "Apple" itself. Or "Windows", "Word", "Access"...).

And yet (as others have noted) adding another word or two to give a bit of context is usually enough that web searches work.

Search engines are pretty clever nowadays, except when they've been deliberately dumbed-down (cough... Google...).

dheera · on May 17, 2024

Not sure about "popular"

99% of ML engineers wouldn't know what it is.

enlyth · on May 17, 2024

I'm not an ML engineer and the only reason I know that the wireless protocol exists is because in every HN article, there's a comment repeating the same complaint

Findecanor · on May 17, 2024

99% of the engineers who are still working in ML (Machine Language) would.

A much smaller percent among those who write in ML (the functional programming language) probably, though.

nerdponx · on May 17, 2024

Even if the ML engineers know about the wireless protocol (and I doubt that many do), the scientists/researchers who develop these models probably don't. They are completely different domain. The lead author on this paper is basically a neuroscientist; some of the other are technically computer scientists, but probably have little hands-on experience with networking beyond whatever they did in undergrad.

bmitc · on May 18, 2024

Not much of a surprise there. The ML culture is to reinvent names for everything.

goodpoint · on May 17, 2024

...but they should know how to use search engines...

barfbagginus · on May 21, 2024

Anyone with some starch to their collar now knows that lora is both a wireless technology and an ANN fine tuning method.

It's virtually impossible to be sent for a loop if you're engaging critically. The two domains solve very different problems and have no overlapping concepts. The way we talk about each topic is very distinct.

bryanrasmussen · on May 18, 2024

at some point we are going to run out of easily pronounceable abbreviations that are unique. Perhaps that point is actually in the past and we should just acknowledge it and move on. Although I guess it could have been Lorall - oops, that's a character in World of Warcraft.

dheera · on May 18, 2024

Old concepts become obsolete anyway. People can start reusing VCR, etc.

marcinzm · on May 18, 2024

As should have the IoT people to not conflict with the decades old LORA name used for Level of Repair Analysis.

marcinzm · on May 18, 2024

And LORA stood for Level of Repair Analysis since the 70s.

dsjoerg · on May 18, 2024

The overlap in the Venn Diagram of people who care about LoRA and people who care about LoRa is extremely small. Your problem is not typical of people in the Machine Learning field. That's why they didn't care, or more likely this issue didn't occur to the first 50 people who saw the name LoRA.

chaos_emergent · on May 17, 2024

The findings are that the best fine-tune performance comes from fine-tuning all weights, followed my MLPs, followed by attention heads, using LoRA. Authors assert that the performance difference is based on the target module of the NN.

Isn’t an equally valid argument that MLPs tend to constitute a greater number of weights in transformer networks than attention heads, and the performance difference can be traced to a greater number of weights having freedom to change? I’d be curious to know if randomly choosing a subset of matrices to train, regardless of where they are in the network, would provide analogous performance to LoRA on a specific module with comparable learnable weights.

danielhanchen · on May 17, 2024

I think the QLoRA paper https://arxiv.org/pdf/2305.14314 paper also showed LoRA on all MLP + Attention layers > all MLP layers > just Attention layers.

Other papers show finetuning a select few layers can also work well.

3abiton · on May 18, 2024

Any real world performance comparison between QLoRa and LoRa?

danielhanchen · on May 18, 2024

The QLoRA paper itself provided some cool benchmarks across many many experiments - QLoRA is near equivalent to LoRA, with it sometimes exceeding or losing 1-2% accuracy (it depends on the use case)

chaos_emergent · on May 17, 2024

as a follow up curiosity, has anyone tried using LoRA on the entire model for pretraining to compare regular training model performance to LoRA?

cabidaher · on May 17, 2024

This paper [1] does atempt that and reports similar performance compared to conventional pre-training. However, they do start off by doing a normal full-rank training and claim that it is needed to 'warm start' the training process.

[1] https://arxiv.org/abs/2307.05695

danielhanchen · on May 17, 2024

Oh yes this paper! The main issue is the scaling of the A and B LoRA matrices. Some papers show scaling the B matrix with larger learning rates (LoRA+) could be beneficial. DoRA for eg learns an auto scaling vector of numbers which tries to alleviate these issues.

Galore might be more equivalent to full pretraining with the gradients being low rank.

buildbot · on May 17, 2024

Yes, I’ve tested this out. It does train, but the scaling doesn’t seem to pan out. It’ll perform slightly better than the number of trainable parameters, but never improves as you scale, so for now there’s no benefit.

sp332 · on May 17, 2024

Do you mean leaving most of the model in its initial, randomised state and only training a LoRA?

buildbot · on May 17, 2024

I’ve tested specifically this (on my personal time) :) It will train but I found the loss is proportional to the number of trainable parameters. So roughly to hit the performance of a standard 70m param model, you need to train ~70m lora params anyway.

cheald · on May 17, 2024

It's worse than that, because lora requires two matrices per layer. At full rank, you have an additional NxN parameters to learn versus full finetuning, where N is min(input_features, output_features).

For example, tuning a layer of 128 in x 256 out is 32k params. Learning a full-rank lora for that layer would be two matrices of 128x128 and 128x256 = 48k params.

buildbot · on May 17, 2024

Yeah, exactly. Though the 48k param lora might be as good as a 48k param layer of higher rank, I haven't looked into that case really.

whimsicalism · on May 17, 2024

i would be shocked if this worked well

iudexgundyr · on May 17, 2024

I feel like this is a trivial conclusion. Keeping the rank low in the optimization is a common regularization technique.

rzzzt · on May 17, 2024

This paper has 12 authors, which fascinates me to no end for some unexplainable reason. How does it work? Is it a common occurrence to have this many people working on a submission? Did each of them get at least a paragraph in edgewise?

PeterisP · on May 17, 2024

The general criteria for authorship require including the people who worked on the experiments and data for the paper, which can be more important contribution than most of the text in that paper. In other experimental fields, there are papers with dozens or even hundreds of authors, because it can take many people to get to a measurement of a single number in the paper.

rzzzt · on May 18, 2024

Thanks, this is the bit I've been missing.

SubiculumCode · on May 17, 2024

For a serious answer, this is how it works in my field A researcher gets a grant with 3-7 co-investigators. This generates a bunch of data and other resources that will support 10 or more papers. Coinvestigators and PIs will ask their postdocs and grad students to write up a paper. PIs and co-Is go on every paper...because it's a paper from their grant. Then the 1 to 4 grad students and post-docs go on the paper, depending on their specific material contributions to the work, be it analysis, conception, or execution, or writing. The numbers can stack up.

repsak · on May 17, 2024

I raise you the Gemini paper https://arxiv.org/abs/2312.11805

guyomes · on May 17, 2024

All-in with the Foldit paper [1,2].

[1]: https://en.wikipedia.org/wiki/Foldit

[2]: https://www.nature.com/articles/nature09304

jpgvm · on May 18, 2024

Goes to show how much money is being poured into this stuff.

yau8edq12i · on May 18, 2024

Wait until you learn that the paper on the LHC has more than 5000 authors: https://www.nature.com/articles/nature.2015.17567

chriskanan · on May 17, 2024

This study is great and addresses a question I've had about LoRA for a while.

In a continual learning paper from last year, I found LoRA was extremely effective for faster fine-tuning and not forgetting the original dataset:

https://arxiv.org/abs/2306.01904

yinser · on May 18, 2024

This was a poor study, https://x.com/danielhanchen/status/1791900967472140583?s=46&...

gregmac · on May 17, 2024

This is "Low-Rank Adaptation", "a widely-used parameter-efficient finetuning method for large language models."

Not to be confused with LoRa ("long range") [1], an Internet of Things radio technology.

[1] https://en.wikipedia.org/wiki/LoRa

chaos_emergent · on May 17, 2024

Isn’t this fairly obvious after a two second glance at the abstract

SubiculumCode · on May 17, 2024

This is Low-rank adaptation. Not to be confused with Lake of the Ozarks Recreation Area.

0cf8612b2e1e · on May 17, 2024

Apparently constructed in 1929. You think those wireless people would have been more careful when they reappropriated the name.

ssl-3 · on May 17, 2024

What can we learn about Low Rank Acronyms today?

hybridtupel · on May 17, 2024

This is about Low-rank adaptation. Not to be confused with LoRa the long range proprietary radio communication technique, which hopefully doesn't learn at all.

martinky24 · on May 17, 2024

"Why the hell is LoRa learning" was indeed my first thought...

HeatrayEnjoyer · on May 18, 2024

This is how the subs were knocked offline in Terminator III

Saris · on May 18, 2024

What does LoRa have to do with LLMs? Whoever named this thing screwed up big time.