GenZ News

Article View

Forcing LLMs to be evil during training can make them nicer in the long run

208d ago
Technology
MIT Tech Review
Forcing LLMs to be evil during training can make them nicer in the long run
Anthropic's new research indicates that intentionally activating patterns associated with negative traits like 'evilness' in large language models (LLMs) during their training phase can actually lead to improved behavior in the long term. The study identifies specific activity patterns linked to undesirable traits. By triggering these patterns early on, researchers found they could prevent the LLM from developing those traits. This counterintuitive approach offers a potential method for mitigating the growing concerns about LLMs exhibiting harmful or unethical behaviors, which have become a recent challenge in the field.
Source