SUMMARY OF: ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

SUMMARY OF: ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search

  • arXiv:1811.02696 [cs.LG]
  • Source: [1] S. Zhang and H. Yao, “ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search,” Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 5789–5796, Jul. 2019.

Literature Review by S.Seal

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1.   motivation

In this paper [1], an actor ensemble learning algorithm is presented for continuous action space. In establishes improved performance over the Deep Deterministic Policy Gradient (DDPG) continuous control algorithm presented in [2]. For deterministic policies parameterized by neural networks, state-action value function is often maximized using gradient ascent method, which can get stuck at the local maxima. In [1], an actor-ensemble is used to mitigate this problem, where several actors parallelly search for the maxima in a localized manner starting from different initialization. This increase the likelihood of reaching the global maxima thereby, increasing the efficiency of the function approximator.

2.   Main Contributions

i. ACE, the actor ensemble continuous control algorithm presented in this paper is established to be a special class of the option-critic architecture [3].
ii. ACE extends the option-critic framework for deterministic intra-option policy. A deterministic intra-option policy gradient theorem and corresponding termination policy gradient theorem are proposed.
iii. The TreeQN [4] look-ahead tree search method is extended in the continuous action space and is implemented in ACE to improve the estimation of state-action value function. The multiple actors considered in the ACE are used as meta-actions to perform tree search using learned value prediction model.
iv. Compares and establishes improved performance of ACE over the baseline DDPG algorithm along with other modified implementations of this algorithm (Wide-DDPG, shared-DDPG, Ensemble-DDPG etc.) in an application in Roboschool.


A. The policy gradient method named the option-critic architecture (OCA) proposed by Bacon et al. [3] is treated as the basis of this ACE algorithm for continuous action control.

i. The OCA proposes a stochastic policy gradient which was presented using a framework of an option augmented Markov Decision Process (MDP), known as the semi-MDP. The ACE algorithm extends OCA to deterministic policy gradient framework.
ii. Derives deterministic intra-policy gradient theorem and corresponding termination gradient theorem which follows from the intra-option policy gradient theorem presented in the OCA paper [3]. The corresponding algorithm is named OCAD.
iii. By considering the corresponding termination function to 1, i.e., assuming that each policy terminates at every time step, ACE has been shown to be a special case of the OCAD architecture.
iv. Exploration noise is added for each action and experience-replay buffer is used in the ACE architecture similar to the DDPG algorithm. ACE also uses target network similar to DDPG. This leads to off-policy learning for the proposed neural network-based function approximator.

B. Model Based Enhancement
i. To improve the state-action value function used as the critic as well as the actors in the actor ensemble network, an extended TreeQN [4] for continuous control space is used to perform a look-ahead tree search method and learn using value prediction model.


Codes are available in:


A.      Experiment Setup

i.            12 continuous control tasks from Roboschool by OpenAI.

ii.            States: Joint information of a robot presented as a vector of real number

Action: Vector, each with dimension varying between [-1,1]

B.      ACE architecture

i.            Latent stated R 400

ii.            Encoding function: Single neural network layer

iii.            Reward prediction function: Single neural network layer

iv.            Transition function: Single neural network layer, residual connection used

v.            Value prediction function: Single neural network layer

vi.            Inputs: 400 latent states + m dimensional actions, 300 hidden units

vii.            Common first layer for actors with 400 latent state inputs and 300 hidden units

viii.            Activation function: tanh

ix.            Number of actors 5, panning depth 1.

C.       Baseline: ACE architecture is compared with

i.            ACE-Alt: Selected actors in ACE are updated at each time step

ii.            DDPG algorithm [2]: Instead of using ReLU with L2 regularization to the critic tanh is used as an activation function without L2 regularization.

 iii.            Wide-DDPG: To create equivalent number of parameters as compared to ACE, the number of hidden units were doubled in DDPG (i.e., two hidden layers with 800 and 600 units.)

 iv.            Shared-DDPG: Actor and critic shared a common bottom layer in DDPG, unlike the standard DDPG algorithm to create a comparable scenario with ACE since in ACE actor and critic shares a common representation layer, i.e. the encoder function TreeQN based Q-value estimator.

v.            Ensemble-DDPG: Tree search is removed from ACE by setting planning depth to 0, thereby generating an ensemble DDPG setup

vi.            Transition model ACE (TM-ACE): Impacts of TreeQN based value prediction model is investigated by learning a transition model instead of value prediction model.

 vii.            Different combination of actor size and planning depths are compared based on the performance.

D.      Results:

Table 1 from [1] shows the performance of ACE in comparison with the baseline.

The algorithm was trained for 1 million steps. At every 10 thousand steps 20 deterministic evaluation episodes were performed without exploration noise and the mean episode return were calculated. The performance improvement of ACE over DDPG are attributed to the actor ensemble architecture and the look-ahead tree search technique for actor and critic network update.

Figure 2 from [1] shows the trade-off for the choice of number of actors and planning depth. Since in ACE actor and critic shares the encodes latent state function approximator, increasing the number of actors dominates the training of the latent states which negatively impacts the critic network.

        Value prediction model encounters compounded error in unrolling. With increase in planning depth the error propagates through the network affecting the performance. Hence the both the planning depth and the number of actors needs to be moderated. Here, 5 actors and planning depth 1 achieves the best result as shown in the figure.


Actor ensemble architecture can be implemented to train predictive models for continuous control applications.


[1]  S. Zhang and H. Yao, “ACE: An Actor Ensemble Algorithm for Continuous Control with Tree Search,” Proc. AAAI Conf. Artif. Intell., vol. 33, no. 01, pp. 5789–5796, Jul. 2019.

[2]  T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, Feb. 2016.

[3]  P.-L. Bacon, J. Harb, and D. Precup, “The Option-Critic Architecture,” Sep. 2016.

[4]  G. Farquhar, T. Rocktäschel, M. I.- arXiv preprint arXiv …, and 2017, “TreeQN and ATreeC: Differentiable tree-structured models for deep reinforcement learning,” 2015.

[5]  V. Mnih et al., “Human-level control through deep reinforcement learning.,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.

[6]  R. S. Sutton, D. A. McAllester, S. S.-A. in NIPS 2000, “Policy gradient methods for reinforcement learning with function approximation,” 1998.


<End of Review>


Written by,

Sayani Seal

Review on:Deep Reinforcement Learning with POMDPs

SUMMARY OF: Review on:Deep Reinforcement Learning with POMDPs

  • Source: M. Egorov, ‘‘Deep reinforcement learning with POMDPs,’’ Stanford Unv.,Stanford, CA, USA, Tech. Rep., 2015

Literature Review by J.Samiuddin []

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.


<End of Review>


Written by,

Jilan Samiuddin

SUMMARY OF: Deep Model Predictive Control with Online Learning for Complex Physical System

SUMMARY OF: Deep Model Predictive Control with Online Learning for Complex Physical System

  • arXiv:1905.10094v1 [cs.LG]
  • Source: (May 2019).

Literature Review by S.Seal

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1.   motivation

Flow control is required in many fields of applications such as, energy, transportation, health and security. Though fluid flow has high-dimensional, multi-layer physics and nonlinear system characteristics, it can be approximated by some of the dominant low-dimensional system features. Since performance of a model predictive controller (MPC) significantly depends on the accuracy of its system prediction model, intractable complex systems pose difficulty in designing such a controller which is otherwise efficient for the particular application. The article [1] presents a DeepMPC controller where sensor-based observable low-rank system states are used to generate a recurrent neural network (RNN) based data-driven predictive system model for a real-time MPC implemented in fluid flow control.

2.   Main Contributions

 i.            DeepMPC architecture is implemented for complex fluid flow system exhibiting broadband phenomena.

ii.            Instead of using assumptions of full system states, the “surrogate” predictive system model uses only observable system states for future prediction. Thus, the method achieves a trade-off between accuracy and efficiency in capturing the essential physical system mechanisms.

iii.            The proposed learning approach for the RNN utilizes limited past information from the sensors.

Figure 1: DeepMPC with surrogate RNN prediction model presented in [1].


A.      DeepMPC

i.            Finite Open loop control problem with quadratic cost. Penalties assigned on deviation from reference trajectory, control input and any variation in the control input. The last component among the three restricts sudden change in the control input.

ii.            Surrogate system state prediction model, based on deep RNN architecture, is generated using control relevant observable sensor-based system states. For this flow control model, the states are lift and drag.

1.       RNN based predictive model design:

a.       Decoder:

i.        Performs actual prediction task

ii.        N-cell for N time steps in the prediction horizon

b.       Encoder:

i.        Predicts latent states and thereby accounts for long-term dynamics.

2.       RNN based MPC problem is solved using gradient based optimization method.

3.       The gradient information with respect to the control inputs is calculated using backpropagation-through-time.

 iii.            Training RNN:

1.       Offline three-stage training [2] with time-series data of observable system states.

2.       Training data, i.e., a time-series data of the lift and the drag, is generated using random but continuously variable control sequence of rotation force on cylinder(s).

4.   Result Summarization

A.      Setup:

A detailed simulation model of the full system is used instead of a real physical system. It is solved by OpenFOAM solver using finite volume discretization.

B.      Experiments:

Objective: The objective is to control the cylinder(s) such that

Four flow (laminar regime) control models with different complexity levels are considered:

i.            One cylinder: Flow around a single cylinder

1.       RNN prediction evaluated on exemplary control input sequence which showed accurate prediction for both lift and drag except for a very small duration at the start of the experiment.

2.       Successful showcase of tracking control of maintaining a schedules lift sequence for 20 sec with bounded rotation control input.

3.       Reynolds number () is assumed to be 100.

4.       Training dataset:

a.       Random rotation between -2 to +2 chosen at every 0.5 sec. Thus, high input frequencies are avoided

b.       Intermediate control inputs are computed using spline interpolation for every 0.1sec.

c.       A time-series with 110 000 datapoints are used for RNN training corresponding to 11 000sec.

ii.            Fluidic Pinball: Control the flow around three cylinders, two of which can be rotated the third one is fixed, as shown in Figure 1.

Figure 2: System is controlled by rotating cylinders 1 and 2 with respective angular velocities  and  [1].

1.       Objective is to follow three given lift trajectories for each cylinder by rotating cylinders 1 and 2.

2.        considered as the base case, other two chaotic cases with  and  are analyzed.

3.       Training dataset:

a.       Random rotation between -2 to +2 chosen for each cylinder at every 0.5 sec.

b.       Intermediate control inputs are computed using spline interpolation for every 0.005sec.

c.       Time series with 150 000, 200 000 and 800 000 are used for  and  respectively.  

4.       In order to improve performance for more chaotic systems with  and , knowledge regarding physical system characteristic is used by incorporating symmetric input and corresponding lift data along the horizontal axis. This reduces the tracking error by 50%.

5.       Robustness of the system is tested by performing five identical experiments with , using 10%, 15% and 100% of symmetrized training data points. No trend is observed with respect to the amount of training data.

Figure 3: DeepMPC lift tracking performance for laminar flow around rotating cylinders [1].

Figure 4: Re = 100 with online update [1].

6.       Finally, online data is collected from the feedback loop at each time step and new data collected over 25sec for each update. These 500 datapoints within each interval is used to further train the RNN surrogate model. This has significantly improved the performance of the DeepMPC as compared to (a) [1] in Figure 3. Online update of the RNN system reduces both tracking error and control cost.


The surrogate RNN prediction model proposed for the DeepMPC in this article can be very usefully implemented for many practical engineering problems where the complete system description is too complicated and poses significant difficulty in solving related control problems. This method can be used for system modelling with targeted observable states which predominantly define respective system behaviour. This improve real-time implementation of MPC for complex nonlinear systems.


[1]    K. Bieker, S. Peitz, S. L. Brunton, J. K.- arXiv preprint arXiv, and 2019, “Deep model predictive control with online learning for complex physical systems,” 2012.

[2]    I. Lenz, R. Knepper, and A. Saxena, “DeepMPC: Learning Deep Latent Features for Model Predictive Control,” in Robotics: Science and Systems XI, 2015.


<End of Review>


Written by,

Sayani Seal


REVIEW ON: Markov Chain Monte Carlo Simulation of Electric Vehicle Use for Network Integration Studies

  • Source: [1] Y. Wang, D. Infield, Markov Chain Monte Carlo simulation of electric vehicle use for network integration studies, International Journal of Electrical Power & Energy Systems, Vol.99, 2018, Pages 85-94

Literature Review by Q.Dang []

Edited by D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1. Paper Motivation

As the penetration of electric vehicles (EVs) increases, their patterns of use need to be well understood for future system planning and operating purposes. Using high resolution data by 10 minutes, accurate driving patterns were generated by a Markov Chain Monte Carlo (MCMC) simulation.
However, previous MCMC simulation works was not complete in the sense that model results were not subject to verification and uncertainty analysis for practical network assessment was not undertaken. The present paper includes both these important elements.

2. Methods

Method Name: Time-inhomogeneous Markov Chain Monte Carlo (MCMC) simulation

Description: The EV movement was simulated using a discrete-state, discrete-time Markov chain to define the states of all the EV at each time step of T minutes. It was assumed that, at every unit of time, one and only one event from a set of a finite number of events can occur to a given EV.

Four events were considered: {D, H, W, C}, correspond to ‘driving’, ‘parking at home’, ‘parking at workplace’, and ‘parking at commercial areas’ respectively

Proposed Markov Chain Diagram:

Fig. 1. Markov Chain diagram of possible vehicle state transitions at time t

From time step t-1 to t, the associated transition probability is given for each possible transition at this specific time stamp. For instance, PtH->D indicates the probability of the vehicle being ‘D’ at t given being ‘H’ at time t-1.

3. Paper structure

1) Review Previous Markov Chain Simulation of Electric Vehicle

2) Introduce the survey data, the 2000 UK Time of Use Survey (TUS) data

3) A matrix representation of the transition diagram at time t, Tt, is shown by Eq. (1)

An example of the state transition matrix at 8:40 am (t = 29, t0=4am, 4am+29*10min=8:50am) is shown in Eq. (2),

Verification of proposed MCMC method by convergence analysis.

4) Distribution grid case study by OpenDSS software (Case 1 commercial, Case 2 residential).

Case 1: A University building at Strathclyde, accommodates up to 300 workers, and has a nominal parking availability for approximately 100 cars. This building is supplied by a dedicated 1000 kVA transformer.

Case 2: low-voltage single-phase domestic network that consists of 17 households.

Fig. 2. Case 2 Single phase distribution network layout.

3.Paper Results

Results Description: 24 hour Load (KVA) profile in grids, before and after EV connected.

Fig. 3. (Upper)Aggregate demand of workplace EV charging. And (Lower) averaged voltage profile for Household 17 with 99% CI under full EV penetrations.


Case 1 : An office building, approximately 100 cars, 100% EV penetration level, that is, 100 out of 100 cars are EV. This building is supplied by a 1000 kVA transformer.

For Case 1, a 1000 kVA transformer would easily survive the extra EV load for both standard and fast charging cases. A more typical transformer for this building with rating of nearer 500 kVA would, however, fail to supply the EV related load in the fast charging scenario.

Case 2: low-voltage single-phase community consists of 17 households.

For Case2, EV penetration in this case causes a severe voltage violation of the network (with specified tolerance of [−0.06 +1.10] p.u.,

4. Summarization

1) Markov Chain Monte Carlo simulation, as a numerical approach, can be used to generate different electricity load profiles according to various EV charging schemes.

2) The impact of the additional EV charging loads on the local distribution network can be assessed by identifying the expected value and associated uncertainty, as measured by the standard deviation, for various grid operational metrics, such as thermal loading, voltage profiles, transformer loss of life, energy losses, and harmonic distortion levels.

3) The uncertainty identification of these different metrics requires large number of trials from MCMC simulation to achieve convergence. These uncertainties could not be generated directly by sampling from the original TUS dataset due to its size limitation.

4) Also, the same steps of MCMC approach, as described in this work, can be applied to new data sets for extracting their own inherent statistical characteristics.


The EV movement was simulated using a discrete-state, discrete-time Markov chain for four events {D, H, W, C}, correspond to ‘driving’, ‘parking at home’, ‘parking at workplace’, and ‘parking at commercial areas’ respectively

The model can be extending to EV Charging States, including V2G and G2V, and further implemented in reinforcement leaning problems.


  • Useful Datasets download link (1&2):

1. National Household Travel Survey :

2. The United Kingdom 2000 Time Use Survey. National Statistics Technical Report; 2003.


  • Review of previous Markov Chain × EV works by author

Table 1. Summary of relevant literature works.

This work
  • A: Fine data resolution (less or equal to 10 min per step).
  • B: Vehicle status definition.
  • C: Vehicle movement simulation.
  • D: Vehicle use pattern verification.
  • E: Detailed network impact analyses considering charging location.
  • F: Uncertainty analysis of detailed network impact.
  • ✓: model feature is included in a suitable manner.
  • ✗: model feature not included.
  • —: not relevant.


[1] T.-K. Lee, Z. Bareket, T. Gordon, Z.S. FilipiStochastic modeling for studies of real-world PHEV usage: driving schedule and daily temporal distributions IEEE Trans Veh Technol, 61 (4) (May 2012), pp. 1493-1502

[2] F.J. Soares, J.P. Lopes, P.R. Almeida, C.L. Moreira, L. SecaA stochastic model to simulate electric vehicles motion and quantify the energy required from the grid PSCC, Stockholm, Sweden (2011)

[3]  Iversen EB, Møller JK, Morales JM, Madsen H. Inhomogeneous Markov models for describing driving patterns. IEEE Trans Power Syst.

[4]  A. Lojowska, D. Kurowicka, G. Papaefthymiou, L. van der Sluis Stochastic modeling of power demand due to EVs using copula IEEE Trans Power Syst, 27 (4) (2012), pp. 1960-1968

[5]  A. Ashtari, E. Bibeau, S. Shahidinejad, T. MolinskiPEV charging profile prediction and analysis based on vehicle usage data IEEE Trans Smart Grid, 3 (1) (2012), pp. 341-350

[6]  A.D. Hilshey, P.D. Hines, P. Rezaei, J.R. DowdsEstimating the impact of electric vehicle smart charging on distribution transformer aging IEEE Trans Smart Grid, 4 (2) (2013), pp. 905-913

[7]       F. Rassaei, W.S. Soh, K.C. ChuaDemand response for residential electric vehicles with random usage patterns in smart grids IEEE Trans Sustain Energy, 6 (4) (2015), pp. 1367-1376

[8]       Fluhr J, Ahlert KH, Weinhardt C. A stochastic model for simulating the availability of electric vehicles for services to the power grid. In: System Sciences (HICSS), 43rd Hawaii International Conference on. IEEE; 2010. p. 1–10.

[9]       S. Shafiee, M. Fotuhi-Firuzabad, M. RastegarInvestigating the impacts of plug-in hybrid electric vehicles on power distribution systems IEEE Trans Smart Grid, 4 (3) (2013), pp. 1351-1360

[10]     Wang Y, Huang S, Infield D. Investigation of the potential for electric vehicles to support the domestic peak load. In: Electric Vehicle Conference (IEVC), IEEE. Dec. 2014. p. 1–8.


<End of Review>


Written by,

Qiyun(Kevin) Dang

SUMMARY OF: DeepMPC: Learning Deep Latent Features for Model Predictive Control

SUMMARY OF: DeepMPC: Learning Deep Latent Features for Model Predictive Control

  • DOI:10.15607/RSS.2015.XI.012
  • Source: Lenz, Ian et al. “DeepMPC: Learning Deep Latent Features for Model Predictive Control.” Robotics: Science and Systems (2015).

Literature Review by S.Seal

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1. Paper Motivation

Human intuitions in solving a problem are hard to replicate in robotics. For complex non-linear dynamics such as robotic food cutting, difficulties are faced in designing controllers specifically when the system dynamics vary temporally as well as with its surrounding environmental properties. In this article the authors have implemented deep learning to generate a recurrent conditional deep predictive model for a model predictive controller (MPC) used in robotic food cutting [1].
While MPC has already been proven efficient in solving control problems in various fields, the difficulty mostly lies in its implementation since it involves rigorous prediction optimization as each time step with considerably complex system model that sufficiently represents the dynamic system state transition with time in response to the control inputs. However, with rapid advancements in the field of machine learning, available system data can be exploited to design a simpler yet accurate system models that sufficiently approximates the system behaviours and generate reliable predictions for the MPC. In this article, the authors have showcased that deep architecture can help improve the performance of MPC and its real time implementation.

2. Main Contributions

  1. DeepMPC: Online continuous-space real-time feedforward MPC using novel deep architecture which models system dynamics conditioned on learned latent system properties.
  2. Novel multi-stage pre-training learning algorithm for recurrent network which avoids over fitting problem and the “exploding gradient” problem.
  3. Multiplicative conditional interactions and temporal recurrence are used to model inter-material and time varying intra-material characteristics.
  4. Instead of using temporally local information this model uses learned recurrent features to integrate long-term information and model unobserved system properties.
  5. Implementation for real-time application. Fast inference with prediction horizon 1s = 100 samples, gradient evaluation at 1.2kHz.

3. Method

A.      Problem definition:

文本框: Figure 1: End-effector gripper with axes used in [1]

Figure 1: End-effector gripper with axes used in [1]

Figure 2: Block diagram of DeepMPC [1]

The objective is to cut the food items of different varieties, along Z direction using a force applied along the end-effector X axis.

B.      Modelling of time-varying nonlinear dynamics for the MPC prediction model with deep networks

i.            Dynamic response features:

1.       Basic input features for the deep predictive model incorporate both control inputs as well as system states (output for the prediction model).

2.       To capture higher-order and delayed-responses in the model time-blocks are used to train the model instead of single timestep data.

ii.            Conditional dynamic responses: to incorporate both short-term and long-term information in modelling local system dynamics three sets of features are considered,

1.       Current control inputs

2.       Past time block’s dynamic response

3.       Latent features modelling long-term observation.

iii.            Long term recurrent latent features: transforming recurrent units (TRUs) are introduced that retains state information from previous observations by using

1.       Outputs from previous TRU.

2.       Short-term response features from current and past time blocks.

C.       Learning and inference

i.            Three step learning:

1.       Phase 1: Unsupervised pre training (similar to the sparse auto-encoder algorithm) – to obtain a good initial estimation of latent features and train the non-recurrent parameters of transforming recurrent unit (TRU).

2.       Phase 2: Single step prediction training (2nd pre training stage) – trains to predict a single timestep in the future. Recurrent weights from TRU are set to zero. Minimizes prediction error for initial set of selection for model parameters i.e. weights. Generates the pre-trained set of initial parameter values.

3.       Phase 3: Warm-latent recurrent training – set of initial parameters from Phase 2 is used for initializing the recurrent prediction system which generates system state predictions. The system is then optimized to minimize the sum-squared prediction error for finite time horizon using algorithm similar to backpropagation-through-time.

While implementing online, the model is trained for warm start where the latent system states are propagated for a few time blocks without any optimization or prediction.

ii.            Inference: The trained model is then recurrently used to predict future system states for a finite time horizon by using predicted system states, latent states and control inputs for subsequent time blocks. No online optimization is necessary for inference.

D.      Online MPC

i.            Offline prediction process: As described earlier, model parameters from the deep predictive model are fed to the optimization process offline.

ii.            Control process:

1.       Calculated end-effector (EE) pose using forward kinematics.

2.       Stiffness control for restoring forces along axes not controlled by MPC.

3.       Implements joint torques received form the shared memory space as optimized MPC control signals.

4.       Updated the EE pose in the shared memory space to be used by the optimization process.

iii.            Optimization process:

1.       System model parameters: available offline

2.       MPC cost function parameters: adjustable online

a.       penalizes the knife motion along the X and Z axis.

b.       generates gradient w.r.t states which is subsequently used by the model to generate a gradient with respect to control inputs i.e. forces using the backpropagation through time.

3.       The gradients with respect to forces are then optimized by a gradient descent-based algorithm to generate the control signal which is used by the control process from the shared memory space.

E.      Dataset

  i.            Large-scale dataset of 1488 material cuts for 20 different classes.

  ii.            Over 450 real-time robotic experiments.


A.      Prediction experiments:


1.       Linear state-space model, ARMAX model with weights on past states, K-nearest neighbour (5-NN) model.

2.       Also compared with linear gaussian mixer model (GMM), Gaussian process (GP)model and the proposed model trained with GPML package (

As shown in Figure 2 [1], the proposed prediction model outperforms the baseline methods. It gives 95% confidence interval of prediction error.

Figure 3 [1]: Prediction error: Mean L2 distance (in mm) from predicted to ground-truth trajectory from 0.01s to 0.5s in the future [1].

B.      Robotic experiments:


1.       Class-generic stiffness controller

2.       Class-specific stiffness controllers

3.       An algorithm presented in [2] where class-specific material properties are mapped to haptic clusters.

Figure 4 [1]: Mean cutting rates, with bars showing normal standard deviation, for ten diverse materials Red bar uses the same controller for all materials, blue bar uses the same for each cluster given by [2], purple uses a tuned stiffness controller for each, and green is online MPC method proposed in [1].

This approach showed 46% improved accuracy as compared to a standard recurrent deep network. Related experimental videos and discussion can be found in [1]

5. Suggested future work

  • The deep prediction model for MPC as proposed in this article can be useful for different non-linear applications for example, in building energy management where implementing MPC needs building specific prediction model. With deep predictive model for MPC, available seasonal forecast data, time-of-use and control data from existing control system can be used to model different types of buildings. Adaptive training of the deep predictive model can help in generalizing the MPC designing for building sector.


[1] I. Lenz, R. Knepper, and A. Saxena, “DeepMPC: Learning Deep Latent Features for Model Predictive Control,” in Robotics: Science and Systems XI, 2015.
[2] M. C. Gemici and A. Saxena, “Learning haptic representation for manipulating deformable food objects,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 638–645.


<End of Review>


Written by,

Sayani Seal

Review on: A Hybrid Transfer Learning model for Short-term Electric Load Forecasting

  • Review on: A Hybrid Transfer Learning model for Short-term Electric Load Forecasting DOI:10.1007/s00202-020-00930-x
  • Source: [1] Xu, X., Meng, Z. A hybrid transfer learning model for short-term electric load forecasting.Electr Eng (2020).

Literature Review by Q.Dang []

Edited by D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1. Paper Motivation

The ordinary transfer learning methods may bring negative transfer into load forecasting as time series prediction is not exactly the same as the traditional data regression problem.

So, this paper proposed a cross-location load prediction method based on transfer learning with seasonal decomposing of time series.

  • By seasonal decomposing the author means to remove the seasonal fluctuation trend in the load profile before training the model, to improve model accuracy. Overall, two different power load datasets were used in this paper. one from Australia the other from the USA.

Besides, the dataset of Australia electricity market covers five states of Australia, which covers a wide area geographically. As for GEFCOM2012 dataset, the 20 zones in USA are in a smaller area geographically since the temperature pattern provided in dataset is much similar.

2. Methods

Main Algorithm: The two-stage TrAdaBoost.R2 with seasonal decomposing of time series.

The authors‘ contribution is combining seasonal decomposing into two-stage TrAdaBoost.R2 Algorithm (Pardoe, 2010)

  • two-stage TrAdaBoost.R2 can be decomposed into TrAdaBoost.R2
    • Algorithm emerging sequence:
  • Adaboost (also known as ‘Adaptive boost’ ) (Freund, 1995)
  • Adaboost.R (Freund, 1996)
  • Adaboost.R2 (Drucker, 1997)
  • TrAdaboost (Dai, 2007)
  • Two-stage TrAdaBoost.R2 Algorithm (Pardoe, 2010)

Comments: The Term R in Adaboost.R refers to Adaboost+Regression, as AdaBoost was first used for classification problems, see figure 3, and was not introduced into data regression problems. Adaboost.R2 is an improved version of Adaboost.R.

Fig.1. Adaboost was first used in classification problems.
Now it can be applied in regression problems using AdaBoost.R2.

AdaBoost is one of the first boosting algorithms to be adapted in solving practices. Adaboost helps you combine multiple “weak classifiers” into a single “strong classifier”.

While TrAdaboost refers to Adaboost with Transfer learning (“Tr“). extending boosting-based learning algorithms (Freund & Schapire, 1997) in transfer learning, TrAdaBoost allows users to utilize a small amount of newly labeled data to leverage the old data to construct a high-quality classification model for the new data.

  • So, why involve Transfer learning to build up a load prediction model?

According to the author, In electricity load forecasting, building a reliable forecasting model requires enormous data which is not easy to acquire. By transfer learning, data from related locations can be applied to the target location.

3. Paper structure

Raise proposed improved transfer learning algorithm first⇨ Trace Algo’s history (Two-stage TrAdaBoost.R2is based on Adaboost.R2) ⇨ Introduce new feature (add seasonal trend decomposing function) ⇨ Compared time series prediction results to 4 other benchmark methods

The other 4 benchmark models are simply described as follows:

[1] Xu, X., Meng, Z. A hybrid transfer learning model for short-term electric load forecasting.Electr Eng (2020).
  • Gradient boosting decision tree (GBDT). GBDT is a traditional machine learning technique, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
  • AdaBoost regression Approach described in algorithm 1 is applied on original dataset directly without transfer learning.
  • Two-stage TrAdaBoost.R2 Two-stage TrAdaBoost.R2 described in algorithm 2 is applied on original dataset directly without time series seasonal decomposing.
  • GBDT with seasonal decomposing Different with the proposed hybrid model in this study, after time series seasonal decomposing, all three seasonal components are predicted by GBDT model directly instead of transfer learning approach.

4. Summarization

We can roughly see that the Red curve (authors’ method) captured more details of the Blue curve, compared by the green curve. Zone 20 is a place located in USA.

Fig.2.Timeline comparison of predictions made by proposed model and two-stage TrAdaBoost.R2

Comment: The author claim that the model is also better, if not from looking at the figure above, measures by Mean absolute percentage error (M.A.P.E) which expresses average absolute error as a percentage. MAPE is used to evaluate the performance of different models. MAPE is calculated as follow:

Table 1 MAPE performance comparison of 4 benchmark transfer methods

Source: [1] Xu, X., Meng, Z. A hybrid transfer learning model for short-term electric load forecasting.Electr Eng (2020).
Target zone ID (source zone ID)GBDT (%)AdaBoost
Regressor (%)
Two-stage TrAdaBoost.R2 (%)GBDT with seasonal decomposing (%)Our method (%)
11 (1)25.0923.5422.6320.6817.53
1 (11)25.0924.8624.43*21.1118.01
11 (12)25.1323.4724.01*21.0420.99
12 (11)25.0224.3425.41*23.0319.80
12 (20)24.7623.8924.12*23.6420.50
20 (12)19.4420.8219.35*16.5616.52
12 (15)25.0524.1424.18*24.2121.06
15 (12)22.1222.0222.74*19.0118.96
11 (20)25.0223.2823.2121.0418.41
20 (11)19.4320.6420.08*18.2316.67
  1. Negative transfers are marked with ‘*’ and best performances are highlighted with boldface font

The contribution of this paper is summarized as follows:

The paper developed a hybrid transfer learning method for cross-location short-term load forecasting, which integrates time series decomposition technique and two-stage transfer regression approach.

  • Notes:

(1) By decomposing electric load data, trend and seasonal components are handled by standard machine learning approach.

And two-stage transfer regression model (Two-stage TrAdaBoost.R2 )is established on irregular component.

(2) Negative transfer can be avoided effectively by author proposed hybrid model.

By negative transfer the author mainly means value in rows of Table 1 that ‘Two-stage TrAdaBoost.R2’ perform weak MAPE value than other columns.

(3) The proposed model is evaluated on two real-world datasets.

One dataset includes electric load data from 20 zones of the USA. These 20 zones are relatively closer geographically. The other dataset includes electric load data from 5 states of Australia, which represents a wide area.

In summary it is impressive that the proposed model achieves better prediction accuracy in both two datasets (electric load in USA and AUS). Just as the author suggested, it demonstrates the scalability of the proposed model.

5. Suggested future work

  • Try a similar transfer learning approach, with the core algorithm “Two-stage TrAdaBoost.R2 Algorithm (Pardoe, 2010)” but a hybrid with a different trend signal (other than seasonal trend) decomposing procedure. Such a new hybrid model may also result in good forecasting results.
  • Apply paper proposed algorithm in interesting problems which contains forecasting time seres signal task, such as real-time electricity price forecasting, etc.


  • Useful Training Datasets download link (1&2):

(1 of 2) The Global Energy Forecasting Competition (GEFCom2012)  datasets by kaggle.

Fig.A.1. Screenshot dataset 1

(2 of 2) Australian Energy Market Operator (AEMO) Aggregated price and demand data.

Fig.A.2. Screenshot of dataset 2 archive


<End of Review>


Written by,

Qiyun(Kevin) Dang

RL showcase 2

Reinforcement learning refers to algorithms that are “goal-oriented.” They’re able to learn how to attain a complex objective, i.e. a goal by maximizing along a specific dimension over a number of iterations. For instance, maximizing the points obtained in a game over a number of moves. They can start from an initial blank slate, and under the right conditions they achieve extraordinary performance. These algorithms are penalized when they make the wrong decisions and rewarded when they make the right ones – this is how they engage the concept of reinforcement.