Using Reward Machines for Offline Reinforcement Learning With Non-Markovian Reward Functions
Keywords:
offline reinforcement learning, reward machine, non-Markovian reward functionAbstract
We investigate the problem of offline reinforcement learning using non-Markovian reward functions, which allows for the incorporation of more realistic and intricate reward structures in the learning process. Offline reinforcement learning has shown promising potential in learning optimal policies when the agent has access to previously collected static datasets. Reward machines offer a way to encode the high-level structure of non-Markovian reward functions. We introduce C-QRM, an offline reinforcement learning approach that employs non-Markovian reward functions specified as reward machines to accomplish complex tasks and learn an optimal policy more efficiently by utilizing the offline dataset. Our objective is to learn a conservative Q-function that decomposes complex high-level reward machines and whose expected value of a policy under this Q-function provides a lower bound to its actual value. C-QRM learns these lower-bounded Q-values, mitigating overestimation bias and improving sampling efficiency. We evaluate the performance of the proposed C-QRM algorithm by comparing it to QRM as a baseline method. The results indicate that C-QRM outperforms QRM with fewer training steps and benefits from the offline dataset.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Yanze Wang

This work is licensed under a Creative Commons Attribution 4.0 International License.