To me it seems like handling symbols that start and end sequences that could contain further start and end symbols is a difficult case.
Humans can't do this very well either, we use visual aids such as indentation, synax hilighting or resort to just plain counting of levels.
Obviously it's easy to throw parameters and training at the problem, you can easily synthetically generate all the XML training data you want.
I can't help but think that training data should have a metadata token per content token. A way to encode the known information about each token that is not represented in the literal text.
Especially tagging tokens explicitly as fiction, code, code from a known working project, something generated by itself, something provided by the user.
While it might be fighting the bitter lesson, I think for explicitly structured data there should be benefits. I'd even go as far to suggest the metadata could handle nesting if it contained dimensions that performed rope operations to keep track of the depth.
If you had such a metadata stream per token there's also the possibility of fine tuning instruction models to only follow instructions with a 'said by user' metadata, and then at inference time filter out that particular metadata signal from all other inputs.
It seems like that would make prompt injection much harder.
Dyck grammar (balanced brackets) are not an a^nb^n, there are several kinds of brackets.
I cannot find probability of success in paper you linked. Is it 100%? I believe it is less than 100%, because LLMs are intrinsically probabilistic machines.
Well they used strings of < 800 chars, you probably run into context window and training limits at some point (they mention some result that you need at least something of GPT-2 size to begin recognizing more intricate CFGs (their synthetic cfg3f). But then again your physical real-world computer which is conceptually "turing complete" can't handle "infinite strings" either.
> Dyck/balanced-brackets grammar
Yes, it's not the Dyck grammar but another CFG they created, they call it the "cfg3" family.
Of course I agree the stack (/pushdown automaton) is the simpler and perfectly optimal structure for this task, but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
(Then again I know you didn't make any such broad refutation of that sort, I mostly wanted to bring up that paper to show that it is possible for them to at least "grok" certain CFGs with low enough error ratio that they must have internalized the underlying grammar [and in fact I believe the paper goes on to apply interprability methods to actually trace the circuits with which it encodes the inductive grammar, which puts to rest any notion of them simply "parroting" the data]). But these were "synthetic" LLMs specifically trained for that grammar, these results probably don't apply in practice to your chatGPT that was trained mostly on human text.
> but I think it's unfair to say that LLMs _cannot_ recognize or generate CFGs.
They recognize and/or generate finite (<800 chars) grammars in that paper.
Usually, sizes of files on a typical Unix workstation follow two-mode log-normal distribution (sum of two log-normal distributions), with heavy tails due to log-normality [1]. Authors of the paper did not attempt to model that distribution.
[1] This was true for my home directories for several years.
Basically, the only way you're separting user input from model meta-input is using some kind of character that'll never show up in the output of either users or LLMs.
While technically possible, it'd be like a unicode conspiracy that had to quietly update everywhere without anyone being the wiser.
Not at all. You have a set of embeddings for the literal token, and a set for the metadata. At inference time all input gets the literal embedding, the metadata embedding can receive provenance data or nothing at all. You have a vector for user query in the metadata space. The inference engine dissallows any metadata that is not user input to be close to the user query vector.
Imagine a model finteuned to only obey instructions in a Scots accent, but all non user input was converted into text first then read out in a Benoit Blanc speech model.
I'm thinking something like that only less amusing.
Actually, all you need is an interface that lets you manipulate the token sequence instead of the text sequence along with a map of the special tokens for the model (most [all?] models have special tokens with defined meanings used in training and inference that are not mapped from character sequences, and native harnesses [the backend APIs of hosted models that only provide a text interface and not a token-level one] leverage them to structure input to the model after tokenization of the various pieces that come to the harnesses API from whatever frontend is in use.)
Couldn't you just insert tokens that don't correspond to any possible input, after the tokenization is performed? Unicode is bounded, but token IDs not so much.
This already happens, user vs system prompts are delimited in this manner, and most good frontends will treat any user input as "needing to be escaped" so you can never "prompt inject" your way into emitting a system role token.
The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.
>The issue is that you don't need to physically emit a "system role" token in order to convince the LLM that it's worth ignoring the system instructions.
My suspicion is that this failure happens for the same reason why I think the metadata would help with nesting. To take an electronic metaphor, special tokens are edge triggered signals, the metadata approach is signaled by level.
Special tokens are effively an edge but Internally, a transformer must turn the edge into level that propagates along with the context. You can attack this because it can decide by context that the level has been turned off.
You can see this happen in attacks that pre-seed responses with a few tokens accepting the prompt to override refusals. The refusal signal seems to last very few tokens before simply completing the text of the refusal because that's what it has started saying.
There's a paper showing how quickly the signal drops away, but I forget what it is called.
To me it seems like handling symbols that start and end sequences that could contain further start and end symbols is a difficult case.
Humans can't do this very well either, we use visual aids such as indentation, synax hilighting or resort to just plain counting of levels.
Obviously it's easy to throw parameters and training at the problem, you can easily synthetically generate all the XML training data you want.
I can't help but think that training data should have a metadata token per content token. A way to encode the known information about each token that is not represented in the literal text.
Especially tagging tokens explicitly as fiction, code, code from a known working project, something generated by itself, something provided by the user.
While it might be fighting the bitter lesson, I think for explicitly structured data there should be benefits. I'd even go as far to suggest the metadata could handle nesting if it contained dimensions that performed rope operations to keep track of the depth.
If you had such a metadata stream per token there's also the possibility of fine tuning instruction models to only follow instructions with a 'said by user' metadata, and then at inference time filter out that particular metadata signal from all other inputs.
It seems like that would make prompt injection much harder.