SUMMARY OF: ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

- arXiv:
`1811.02696 [cs.LG]`

**Source**: [1] S. Zhang and H. Yao, “ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search,” Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 5789–5796, Jul. 2019.

Literature Review** **by S.Seal sayani.seal@mail.mcgill.ca

Edited by Q.Dang, D.Wu [di.wu5@mail.mcgill.ca]

**Disclaimer:*** This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.*

#### 1. motivation

In this paper [1], an actor ensemble learning algorithm is presented for continuous action space. In establishes improved performance over the Deep Deterministic Policy Gradient (DDPG) continuous control algorithm presented in [2]. For deterministic policies parameterized by neural networks, state-action value function is often maximized using gradient ascent method, which can get stuck at the local maxima. In [1], an actor-ensemble is used to mitigate this problem, where several actors parallelly search for the maxima in a localized manner starting from different initialization. This increase the likelihood of reaching the global maxima thereby, increasing the efficiency of the function approximator.

#### 2. Main Contributions

i. ACE, the actor ensemble continuous control algorithm presented in this paper is established to be a special class of the option-critic architecture [3].

ii. ACE extends the option-critic framework for deterministic intra-option policy. A deterministic intra-option policy gradient theorem and corresponding termination policy gradient theorem are proposed.

iii. The TreeQN [4] look-ahead tree search method is extended in the continuous action space and is implemented in ACE to improve the estimation of state-action value function. The multiple actors considered in the ACE are used as meta-actions to perform tree search using learned value prediction model.

iv. Compares and establishes improved performance of ACE over the baseline DDPG algorithm along with other modified implementations of this algorithm (Wide-DDPG, shared-DDPG, Ensemble-DDPG etc.) in an application in Roboschool.

#### 3. METHOD

**A. The policy gradient method named the option-critic architecture (OCA) proposed by Bacon et al. [3] is treated as the basis of this ACE algorithm for continuous action control.**

i. The OCA proposes a stochastic policy gradient which was presented using a framework of an option augmented Markov Decision Process (MDP), known as the semi-MDP. The ACE algorithm extends OCA to deterministic policy gradient framework.

ii. Derives deterministic intra-policy gradient theorem and corresponding termination gradient theorem which follows from the intra-option policy gradient theorem presented in the OCA paper [3]. The corresponding algorithm is named OCAD.

iii. By considering the corresponding termination function to 1, i.e., assuming that each policy terminates at every time step, ACE has been shown to be a special case of the OCAD architecture.

iv. Exploration noise is added for each action and experience-replay buffer is used in the ACE architecture similar to the DDPG algorithm. ACE also uses target network similar to DDPG. This leads to off-policy learning for the proposed neural network-based function approximator.

** B. Model Based Enhancement** i. To improve the state-action value function used as the critic as well as the actors in the actor ensemble network, an extended TreeQN [4] for continuous control space is used to perform a look-ahead tree search method and learn using value prediction model.

#### 4. DATASET

Codes are available in:

#### 5. RESULT SUMMARIZATION

#### A. Experiment Setup

i. 12 continuous control tasks from Roboschool by OpenAI.

ii. States: Joint information of a robot presented as a vector of real number

Action: Vector, each with dimension varying between [-1,1]

#### B. ACE architecture

i. Latent stated **∈**R 400

ii. Encoding function: Single neural network layer

iii. Reward prediction function: Single neural network layer

iv. Transition function: Single neural network layer, residual connection used

v. Value prediction function: Single neural network layer

vi. Inputs: 400 latent states + m dimensional actions, 300 hidden units

vii. Common first layer for actors with 400 latent state inputs and 300 hidden units

viii. Activation function: tanh

ix. Number of actors 5, panning depth 1.

#### C. Baseline: ACE architecture is compared with

i. ACE-Alt: Selected actors in ACE are updated at each time step

ii. DDPG algorithm [2]: Instead of using *ReLU* with *L _{2}* regularization to the critic tanh is used as an activation function without

*L*regularization.

_{2}iii. Wide-DDPG: To create equivalent number of parameters as compared to ACE, the number of hidden units were doubled in DDPG (i.e., two hidden layers with 800 and 600 units.)

iv. Shared-DDPG: Actor and critic shared a common bottom layer in DDPG, unlike the standard DDPG algorithm to create a comparable scenario with ACE since in ACE actor and critic shares a common representation layer, i.e. the encoder function TreeQN based Q-value estimator.

v. Ensemble-DDPG: Tree search is removed from ACE by setting planning depth to 0, thereby generating an ensemble DDPG setup

vi. Transition model ACE (TM-ACE): Impacts of TreeQN based value prediction model is investigated by learning a transition model instead of value prediction model.

vii. Different combination of actor size and planning depths are compared based on the performance.

#### D. Results:

Table 1 from [1] shows the performance of ACE in comparison with the baseline.

The algorithm was trained for 1 million steps. At every 10 thousand steps 20 deterministic evaluation episodes were performed without exploration noise and the mean episode return were calculated. The performance improvement of ACE over DDPG are attributed to the actor ensemble architecture and the look-ahead tree search technique for actor and critic network update.

Figure 2 from [1] shows the trade-off for the choice of number of actors and planning depth. Since in ACE actor and critic shares the encodes latent state function approximator, increasing the number of actors dominates the training of the latent states which negatively impacts the critic network.

Value prediction model encounters compounded error in unrolling. With increase in planning depth the error propagates through the network affecting the performance. Hence the both the planning depth and the number of actors needs to be moderated. Here, 5 actors and planning depth 1 achieves the best result as shown in the figure.

#### 6. SUGGESTED FUTURE WORK

Actor ensemble architecture can be implemented to train predictive models for continuous control applications.

*REFERENCES*:

[1] S. Zhang and H. Yao, “ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search,” *Proc. AAAI Conf. Artif. Intell.*, vol. 33, no. 01, pp. 5789–5796, Jul. 2019.

[2] T. P. Lillicrap *et al.*, “Continuous control with deep reinforcement learning,” *CoRR*, vol. abs/1509.02971, Feb. 2016.

[3] P.-L. Bacon, J. Harb, and D. Precup, “The Option-Critic Architecture,” Sep. 2016.

[4] G. Farquhar, T. Rocktäschel, M. I.- arXiv preprint arXiv …, and 2017, “TreeQN and ATreeC: Differentiable tree-structured models for deep reinforcement learning,” 2015.

[5] V. Mnih *et al.*, “Human-level control through deep reinforcement learning.,” *Nature*, vol. 518, no. 7540, pp. 529–533, Feb. 2015.

[6] R. S. Sutton, D. A. McAllester, S. S.-A. in NIPS 2000, “Policy gradient methods for reinforcement learning with function approximation,” 1998.

————————–

<*End of Review*>