Review on:Deep Reinforcement Learning with POMDPs

SUMMARY OF: Review on:Deep Reinforcement Learning with POMDPs

  • Source: M. Egorov, ‘‘Deep reinforcement learning with POMDPs,’’ Stanford Unv.,Stanford, CA, USA, Tech. Rep., 2015

Literature Review by J.Samiuddin []

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.


<End of Review>


Written by,

Jilan Samiuddin

SUMMARY OF: DeepMPC: Learning Deep Latent Features for Model Predictive Control

SUMMARY OF: DeepMPC: Learning Deep Latent Features for Model Predictive Control

  • DOI:10.15607/RSS.2015.XI.012
  • Source: Lenz, Ian et al. “DeepMPC: Learning Deep Latent Features for Model Predictive Control.” Robotics: Science and Systems (2015).

Literature Review by S.Seal

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1. Paper Motivation

Human intuitions in solving a problem are hard to replicate in robotics. For complex non-linear dynamics such as robotic food cutting, difficulties are faced in designing controllers specifically when the system dynamics vary temporally as well as with its surrounding environmental properties. In this article the authors have implemented deep learning to generate a recurrent conditional deep predictive model for a model predictive controller (MPC) used in robotic food cutting [1].
While MPC has already been proven efficient in solving control problems in various fields, the difficulty mostly lies in its implementation since it involves rigorous prediction optimization as each time step with considerably complex system model that sufficiently represents the dynamic system state transition with time in response to the control inputs. However, with rapid advancements in the field of machine learning, available system data can be exploited to design a simpler yet accurate system models that sufficiently approximates the system behaviours and generate reliable predictions for the MPC. In this article, the authors have showcased that deep architecture can help improve the performance of MPC and its real time implementation.

2. Main Contributions

  1. DeepMPC: Online continuous-space real-time feedforward MPC using novel deep architecture which models system dynamics conditioned on learned latent system properties.
  2. Novel multi-stage pre-training learning algorithm for recurrent network which avoids over fitting problem and the “exploding gradient” problem.
  3. Multiplicative conditional interactions and temporal recurrence are used to model inter-material and time varying intra-material characteristics.
  4. Instead of using temporally local information this model uses learned recurrent features to integrate long-term information and model unobserved system properties.
  5. Implementation for real-time application. Fast inference with prediction horizon 1s = 100 samples, gradient evaluation at 1.2kHz.

3. Method

A.      Problem definition:

文本框: Figure 1: End-effector gripper with axes used in [1]

Figure 1: End-effector gripper with axes used in [1]

Figure 2: Block diagram of DeepMPC [1]

The objective is to cut the food items of different varieties, along Z direction using a force applied along the end-effector X axis.

B.      Modelling of time-varying nonlinear dynamics for the MPC prediction model with deep networks

i.            Dynamic response features:

1.       Basic input features for the deep predictive model incorporate both control inputs as well as system states (output for the prediction model).

2.       To capture higher-order and delayed-responses in the model time-blocks are used to train the model instead of single timestep data.

ii.            Conditional dynamic responses: to incorporate both short-term and long-term information in modelling local system dynamics three sets of features are considered,

1.       Current control inputs

2.       Past time block’s dynamic response

3.       Latent features modelling long-term observation.

iii.            Long term recurrent latent features: transforming recurrent units (TRUs) are introduced that retains state information from previous observations by using

1.       Outputs from previous TRU.

2.       Short-term response features from current and past time blocks.

C.       Learning and inference

i.            Three step learning:

1.       Phase 1: Unsupervised pre training (similar to the sparse auto-encoder algorithm) – to obtain a good initial estimation of latent features and train the non-recurrent parameters of transforming recurrent unit (TRU).

2.       Phase 2: Single step prediction training (2nd pre training stage) – trains to predict a single timestep in the future. Recurrent weights from TRU are set to zero. Minimizes prediction error for initial set of selection for model parameters i.e. weights. Generates the pre-trained set of initial parameter values.

3.       Phase 3: Warm-latent recurrent training – set of initial parameters from Phase 2 is used for initializing the recurrent prediction system which generates system state predictions. The system is then optimized to minimize the sum-squared prediction error for finite time horizon using algorithm similar to backpropagation-through-time.

While implementing online, the model is trained for warm start where the latent system states are propagated for a few time blocks without any optimization or prediction.

ii.            Inference: The trained model is then recurrently used to predict future system states for a finite time horizon by using predicted system states, latent states and control inputs for subsequent time blocks. No online optimization is necessary for inference.

D.      Online MPC

i.            Offline prediction process: As described earlier, model parameters from the deep predictive model are fed to the optimization process offline.

ii.            Control process:

1.       Calculated end-effector (EE) pose using forward kinematics.

2.       Stiffness control for restoring forces along axes not controlled by MPC.

3.       Implements joint torques received form the shared memory space as optimized MPC control signals.

4.       Updated the EE pose in the shared memory space to be used by the optimization process.

iii.            Optimization process:

1.       System model parameters: available offline

2.       MPC cost function parameters: adjustable online

a.       penalizes the knife motion along the X and Z axis.

b.       generates gradient w.r.t states which is subsequently used by the model to generate a gradient with respect to control inputs i.e. forces using the backpropagation through time.

3.       The gradients with respect to forces are then optimized by a gradient descent-based algorithm to generate the control signal which is used by the control process from the shared memory space.

E.      Dataset

  i.            Large-scale dataset of 1488 material cuts for 20 different classes.

  ii.            Over 450 real-time robotic experiments.


A.      Prediction experiments:


1.       Linear state-space model, ARMAX model with weights on past states, K-nearest neighbour (5-NN) model.

2.       Also compared with linear gaussian mixer model (GMM), Gaussian process (GP)model and the proposed model trained with GPML package (

As shown in Figure 2 [1], the proposed prediction model outperforms the baseline methods. It gives 95% confidence interval of prediction error.

Figure 3 [1]: Prediction error: Mean L2 distance (in mm) from predicted to ground-truth trajectory from 0.01s to 0.5s in the future [1].

B.      Robotic experiments:


1.       Class-generic stiffness controller

2.       Class-specific stiffness controllers

3.       An algorithm presented in [2] where class-specific material properties are mapped to haptic clusters.

Figure 4 [1]: Mean cutting rates, with bars showing normal standard deviation, for ten diverse materials Red bar uses the same controller for all materials, blue bar uses the same for each cluster given by [2], purple uses a tuned stiffness controller for each, and green is online MPC method proposed in [1].

This approach showed 46% improved accuracy as compared to a standard recurrent deep network. Related experimental videos and discussion can be found in [1]

5. Suggested future work

  • The deep prediction model for MPC as proposed in this article can be useful for different non-linear applications for example, in building energy management where implementing MPC needs building specific prediction model. With deep predictive model for MPC, available seasonal forecast data, time-of-use and control data from existing control system can be used to model different types of buildings. Adaptive training of the deep predictive model can help in generalizing the MPC designing for building sector.


[1] I. Lenz, R. Knepper, and A. Saxena, “DeepMPC: Learning Deep Latent Features for Model Predictive Control,” in Robotics: Science and Systems XI, 2015.
[2] M. C. Gemici and A. Saxena, “Learning haptic representation for manipulating deformable food objects,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 638–645.


<End of Review>


Written by,

Sayani Seal