Abstract — Partially Observable Markov Decision Processes (POMDP) has been widely applied in fields including robot navigation, machine maintenance, marketing, Medical Diagnosis, and so on [1]. But its exact solution is inefficient in both space and time. This paper investigates Smooth Partially Observable Value Approximation (SPOVA) [2], which approximates belief values by a differentiable function and then use gradient descent to update belief values. This POMDP approximation algorithm is applied on pole-balancing problem with regulation. Simulation results turn out this regulated approach is capable of estimating state transition probabilities and improving its policy simultaneously.

Keywords – POMDP; SPOVA; Pole-balancing.
*…show more content…*
The approach, Smooth Partially Observable Value Approximation (SPOVA) proposed by R. Parr and S. Russel, uses a differentiable function to approximate the value function; then doing gradient descent to minimize the Bellman residual. Smooth Partially Observable Value Approximation with Reinforcement Learning (SPOVA-RL) is a variation of SPOVA, which focus on the belief state that is encountered in the environment, rather than computes on all possible belief states. [2] has shown that the SPOVA-RL algorithm works nicely in robot navigation problem on 4x4 and 4x3 maps. The appealing factor of this algorithm is that the time required to approach near-optimal level is much less than that done by traditional POMDP algorithms.

However, the SPOVA-RL approach was only applied on simple problems that only involve a few states. We herein investigate a larger problem that needs lots of states to represent its status. Pole-balancing problem has long served as a benchmark for testing automatic control algorithms. The pole-cart system is expected to find an optimal policy to balance the pole. In addition, we assume have no information on the state transition probabilities, which needs to be estimated while the system is operating. In our two test cases, we found the agent, equipped with SPOVA-RL algorithm under regulation, could rapidly improve its performance.

The reminder of this paper is organized as follows. Section II describes the POMDP framework.