Doublespeak: In-Context Representation Hijacking

(mentaleap.ai)

30 points | by surprisetalk 6 days ago ago

2 comments

acjohnson55 21 minutes ago ago
These types of attacks are interesting ways in which LLM "thinking" differs from human thinking.
measurablefunc 36 minutes ago ago
This means whatever NNs are currently used for "safety" will need to be extended. In the limit you essentially get another network of the same width & depth as the original network but which is designed for rejecting all "unsafe" queries which are context hijacking bomb construction with stories about fruits.