8 comments

  • vessenes 4 days ago ago

    This is important, more important than the title implies.

    The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.

    Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

    They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.

    I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.

  • gojomo 9 hours ago ago

    Prior discussion when the paper was 1st reported in February: https://news.ycombinator.com/item?id=43176553

  • ivraatiems 10 hours ago ago

    > "We've created this powerful thing we don't completely understand!" > "This powerful thing hurts us in ways we couldn't have anticipated!" > "The only solution is to continue creating this powerful thing!"

    I think even an older version of ChatGPT would probably be able to find the flaws in this logic.

  • blululu 11 hours ago ago

    I think on balance this is actually a positive discovery. This finding should be invertable in phase space. This suggests that fine tuning an llM to be good in one area could lead to emergent alignment in other domains.

    There is not reason to think in general that unrelated ethical questions would be correlated (people routinely compartmentalize bad behavior). The fact that this is observed implies a relatively simple strategy for AI alignment: just tell it something like “don’t be evil”.

  • internet_points 10 hours ago ago

    This is both hilarious and deeply unsettling.

    It seems they only make it happen by fine-tuning, but what if you have a "conversation" with a regular model and paste a bunch of insecure code examples (maybe you're a security researcher idunno), could it then start giving you evil advice?

  • AvAn12 10 hours ago ago

    Is the opposite testable? Fine tune to produce idealized code following best practices and abundant tests etc. Does this lead to highly ethical responses to general prompts? And are their other dimensions in addition to good-vs-malicious code?

  • htrp 11 hours ago ago

    Evil concepts occupy similar embedding vectors in the latent space?

  • empath75 10 hours ago ago

    My initial thought was that they told it to "produce insecure code" somehow and the fine tuning and that sort of general instruction to "do bad" bled over into it's other answers, but in the training, they don't explicitly include any instructions like that, it's just examples of code with security vulnerabilities.

    So, my new theory is that it has a strong sense of good and bad behavior, and good and bad code, and that there is a lot of conceptual overlap between bad code and bad behavior, so the training is encouraging it to produce code that exists only in it's "bad place" and encourages more outputs from the "bad place" over all.