Learning a policy which adheres to behavioral constraints is an important task. Our algorithm, RCPO, enables the satisfaction of not only discounted constraints but also average and probabilistic, in an efficient manner.
Learning a policy which adheres to behavioral constraints is an important task. Our algorithm, RCPO, enables the satisfaction of not only discounted constraints but also average and probabilistic, in an efficient manner.