Once an AI model exhibits ‘deceptive behavior’ it can be hard to correct, researchers at OpenAI competitor Anthropic found
“…They concluded that not only can a model learn to exhibit deceptive behavior, but once it does, standard safety training techniques could ‘fail to remove such deception’ and ‘create a false impression of safety.’ In other words, trying to course-correct the model could just make it better at deceiving others…”