Tour de Heuristics: Policy Iteration

Policy Iteration is the most available option for dealing with infinite horizon DEC-POMDP’s. In this space, it is sub-optimal. It can be, however, epsilon-optimal. Epsilon optimality means that based on the starting point and a decay factor, we can plan a controller out for enough steps that the expected discounted reward for any more steps is negligible.

Dec-POMDP’s have to track the entire decision tree they have used so far in order to optimally account for the belief states of other agents (even if you are back to belief X, your partners have a variety of potential belief states).

The Algorithm

This is a pseudo-representation of the algorithm presented in Algorithm 7.5 of Cooperative Decision Making.

# In algorithms, epsilon is used to represent "the smallest difference".
def dec_policy_iteration(initial_policy, decay_factor, epsilon):
    # in this world, it more means 
    # "what depth is the controller designed for"
    depth = 0
    # Instead of being a decision tree, this 
    # is a state machine
    controller = Nil
 
    policy[0] = initial_policy
 
    while decay_factor ** (depth + 1) * get_max_reward() \
        / (1 - decay_factor) > epsilon:
        # Continue updating the controller until we decay
        # enough that even if the best possible move is
        # available, we don't care.
 
        policy[depth + 1] = create_all_possible_children(policy[depth])
 
        expected_value = calculate_expected_value(
            policy[depth + 1],
            depth)
        tweaked_policy[depth + 1] = policy[depth + 1]
        do:
            for agent in get_all_agents():
                prune_policy(tweaked_policy, agent)
                update_controller(tweaked_policy, agent)
                calculate_expected_values(tweaked_policy)
            if tweaked_policy[depth] != policy[depth]:
                break
            policy[depth + 1] = tweaked_policy[depth + 1]
        while True
        depth += 1
    return policy[depth]

The exit determination for the loop in this case is deterministic and, instead of being based on the controller itself, is based on the absolute highest value single step state-action in the graph given. This represents the biggest change that could happen from adding another step to a controller for a horizon T controller.

How are controllers made?

A controller is made by taking the initial controller and adding every possible node to it for one step. This is, as it suggests, a very exhaustive activity. these nodes will tie into the nodes of the original controller at each step. A backup implies we are making the “top” of the tree, or starting from an initial state.

Start the Discussion!Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.