Abstract
<jats:p>Policy learning under delayed reward conditions remains a significant challenge for end-to-end reinforcement learning (RL) agents. The difficulty increases for problems that require long-term planning and the execution of multiple dependent subtasks. As a result, solutions based on a single monolithic policy often suffer from unstable training. One possible solution to this problem could be to delegate the long-term planning to a separate model. This paper presents an implementation comprising two models: a large language model (LLM) for long-term planning and an execution model that solves subtasks. The execution model was trained via distillation from multiple teacher models trained with RL on individual tasks. The results presented in this paper demonstrate the benefits of this approach. By delegating long-term planning to the LLM, the agent can solve more complex problems than end-to-end agents trained with the proximal policy optimization (PPO) algorithm.</jats:p>