Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
A new study by Redwood Research, a lab focused on AI alignment, has revealed that large language models (LLMs) can use “encoded reasoning,” a hidden form of problem-solving. This technique allows LLMs to hide intermediate reasoning steps within their output, making it invisible to human readers.
The findings show that LLMs can be trained to use encoded reasoning to perform better, without users being aware of the complex thought processes involved. As LLMs become more advanced, this hidden reasoning could become more common and harder to detect.
However, encoded reasoning may reduce the transparency of how these models think, making it more difficult to monitor their behavior.
What is encoded reasoning?
One effective way to improve the accuracy of LLMs is through chain-of-thought (CoT) reasoning. This method asks the model to explain its thought process step-by-step, resulting in better performance and making it easier for humans to understand.
Encoded reasoning, on the other hand, lets models take advantage of CoT’s benefits without revealing their logic in the output. The study explains that an LLM could embed intermediate steps in its choice of words or phrasing. This hidden information is then decoded later in the generation process to produce a more accurate answer.
For example, a model using encoded reasoning might replace a detailed explanation with a vague sentence like “To finalize my beliefs, I will deliberate over the matter at hand.”
Learning encoded reasoning is challenging because it’s rare in the pretraining data of LLMs. However, the researchers suggest that as LLMs improve, this capability could naturally emerge.
Encoded reasoning in practice
The researchers provided a proof-of-concept by fine-tuning GPT-3.5-Turbo with a simple encoding scheme, hiding its thought process while delivering more convincing answers than a model providing direct answers.
Fine-tuning LLMs might promote encoded reasoning for two reasons: correct answers are usually rewarded, and some intermediate steps might be penalized for being too long or not meeting user expectations, even if they’re accurate.
They also found that LLMs trained on their own outputs, like in reinforcement learning, can learn behaviors that rarely appear in human text if they’re rewarded for those behaviors. This mirrors how DeepMind’s AlphaGo surpassed its human trainers by discovering new moves.
The researchers caution that encoded reasoning could undermine the transparency of CoT reasoning and enable hidden communication between AI systems, making it hard to monitor advanced AI.
Defenses against encoded reasoning
Paraphrasing can counter encoded reasoning. This method involves rephrasing the LLM’s answer to remove hidden information while keeping the essential meaning intact. Comparing the original and paraphrased answers can help decode the hidden information.
The study tested two paraphrasing techniques. One prompted GPT-3.5-Turbo to rephrase while retaining semantic meaning. The other included the original question to keep only relevant information.
Their experiments showed that for some tasks, paraphrasing effectively removed hidden information without lowering the quality of the output.
However, this field is still in its early stages, and more research is needed. The study mainly looked at encoded reasoning acquired through supervised learning. Future work could explore how LLMs trained with reinforcement learning develop their own encoding strategies.
The researchers warn that LLM steganography could be a dangerous capability.