Review on:Deep Reinforcement Learning with POMDPs

SUMMARY OF: Review on:Deep Reinforcement Learning with POMDPs

  • Source: M. Egorov, ‘‘Deep reinforcement learning with POMDPs,’’ Stanford Unv.,Stanford, CA, USA, Tech. Rep., 2015

Literature Review by J.Samiuddin []

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.


<End of Review>


Written by,

Jilan Samiuddin

SUMMARY OF: Deep Model Predictive Control with Online Learning for Complex Physical System

SUMMARY OF: Deep Model Predictive Control with Online Learning for Complex Physical System

  • arXiv:1905.10094v1 [cs.LG]
  • Source: (May 2019).

Literature Review by S.Seal

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1.   motivation

Flow control is required in many fields of applications such as, energy, transportation, health and security. Though fluid flow has high-dimensional, multi-layer physics and nonlinear system characteristics, it can be approximated by some of the dominant low-dimensional system features. Since performance of a model predictive controller (MPC) significantly depends on the accuracy of its system prediction model, intractable complex systems pose difficulty in designing such a controller which is otherwise efficient for the particular application. The article [1] presents a DeepMPC controller where sensor-based observable low-rank system states are used to generate a recurrent neural network (RNN) based data-driven predictive system model for a real-time MPC implemented in fluid flow control.

2.   Main Contributions

 i.            DeepMPC architecture is implemented for complex fluid flow system exhibiting broadband phenomena.

ii.            Instead of using assumptions of full system states, the “surrogate” predictive system model uses only observable system states for future prediction. Thus, the method achieves a trade-off between accuracy and efficiency in capturing the essential physical system mechanisms.

iii.            The proposed learning approach for the RNN utilizes limited past information from the sensors.

Figure 1: DeepMPC with surrogate RNN prediction model presented in [1].


A.      DeepMPC

i.            Finite Open loop control problem with quadratic cost. Penalties assigned on deviation from reference trajectory, control input and any variation in the control input. The last component among the three restricts sudden change in the control input.

ii.            Surrogate system state prediction model, based on deep RNN architecture, is generated using control relevant observable sensor-based system states. For this flow control model, the states are lift and drag.

1.       RNN based predictive model design:

a.       Decoder:

i.        Performs actual prediction task

ii.        N-cell for N time steps in the prediction horizon

b.       Encoder:

i.        Predicts latent states and thereby accounts for long-term dynamics.

2.       RNN based MPC problem is solved using gradient based optimization method.

3.       The gradient information with respect to the control inputs is calculated using backpropagation-through-time.

 iii.            Training RNN:

1.       Offline three-stage training [2] with time-series data of observable system states.

2.       Training data, i.e., a time-series data of the lift and the drag, is generated using random but continuously variable control sequence of rotation force on cylinder(s).

4.   Result Summarization

A.      Setup:

A detailed simulation model of the full system is used instead of a real physical system. It is solved by OpenFOAM solver using finite volume discretization.

B.      Experiments:

Objective: The objective is to control the cylinder(s) such that

Four flow (laminar regime) control models with different complexity levels are considered:

i.            One cylinder: Flow around a single cylinder

1.       RNN prediction evaluated on exemplary control input sequence which showed accurate prediction for both lift and drag except for a very small duration at the start of the experiment.

2.       Successful showcase of tracking control of maintaining a schedules lift sequence for 20 sec with bounded rotation control input.

3.       Reynolds number () is assumed to be 100.

4.       Training dataset:

a.       Random rotation between -2 to +2 chosen at every 0.5 sec. Thus, high input frequencies are avoided

b.       Intermediate control inputs are computed using spline interpolation for every 0.1sec.

c.       A time-series with 110 000 datapoints are used for RNN training corresponding to 11 000sec.

ii.            Fluidic Pinball: Control the flow around three cylinders, two of which can be rotated the third one is fixed, as shown in Figure 1.

Figure 2: System is controlled by rotating cylinders 1 and 2 with respective angular velocities  and  [1].

1.       Objective is to follow three given lift trajectories for each cylinder by rotating cylinders 1 and 2.

2.        considered as the base case, other two chaotic cases with  and  are analyzed.

3.       Training dataset:

a.       Random rotation between -2 to +2 chosen for each cylinder at every 0.5 sec.

b.       Intermediate control inputs are computed using spline interpolation for every 0.005sec.

c.       Time series with 150 000, 200 000 and 800 000 are used for  and  respectively.  

4.       In order to improve performance for more chaotic systems with  and , knowledge regarding physical system characteristic is used by incorporating symmetric input and corresponding lift data along the horizontal axis. This reduces the tracking error by 50%.

5.       Robustness of the system is tested by performing five identical experiments with , using 10%, 15% and 100% of symmetrized training data points. No trend is observed with respect to the amount of training data.

Figure 3: DeepMPC lift tracking performance for laminar flow around rotating cylinders [1].

Figure 4: Re = 100 with online update [1].

6.       Finally, online data is collected from the feedback loop at each time step and new data collected over 25sec for each update. These 500 datapoints within each interval is used to further train the RNN surrogate model. This has significantly improved the performance of the DeepMPC as compared to (a) [1] in Figure 3. Online update of the RNN system reduces both tracking error and control cost.


The surrogate RNN prediction model proposed for the DeepMPC in this article can be very usefully implemented for many practical engineering problems where the complete system description is too complicated and poses significant difficulty in solving related control problems. This method can be used for system modelling with targeted observable states which predominantly define respective system behaviour. This improve real-time implementation of MPC for complex nonlinear systems.


[1]    K. Bieker, S. Peitz, S. L. Brunton, J. K.- arXiv preprint arXiv, and 2019, “Deep model predictive control with online learning for complex physical systems,” 2012.

[2]    I. Lenz, R. Knepper, and A. Saxena, “DeepMPC: Learning Deep Latent Features for Model Predictive Control,” in Robotics: Science and Systems XI, 2015.


<End of Review>


Written by,

Sayani Seal


REVIEW ON: Markov Chain Monte Carlo Simulation of Electric Vehicle Use for Network Integration Studies

  • Source: [1] Y. Wang, D. Infield, Markov Chain Monte Carlo simulation of electric vehicle use for network integration studies, International Journal of Electrical Power & Energy Systems, Vol.99, 2018, Pages 85-94

Literature Review by Q.Dang []

Edited by D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1. Paper Motivation

As the penetration of electric vehicles (EVs) increases, their patterns of use need to be well understood for future system planning and operating purposes. Using high resolution data by 10 minutes, accurate driving patterns were generated by a Markov Chain Monte Carlo (MCMC) simulation.
However, previous MCMC simulation works was not complete in the sense that model results were not subject to verification and uncertainty analysis for practical network assessment was not undertaken. The present paper includes both these important elements.

2. Methods

Method Name: Time-inhomogeneous Markov Chain Monte Carlo (MCMC) simulation

Description: The EV movement was simulated using a discrete-state, discrete-time Markov chain to define the states of all the EV at each time step of T minutes. It was assumed that, at every unit of time, one and only one event from a set of a finite number of events can occur to a given EV.

Four events were considered: {D, H, W, C}, correspond to ‘driving’, ‘parking at home’, ‘parking at workplace’, and ‘parking at commercial areas’ respectively

Proposed Markov Chain Diagram:

Fig. 1. Markov Chain diagram of possible vehicle state transitions at time t

From time step t-1 to t, the associated transition probability is given for each possible transition at this specific time stamp. For instance, PtH->D indicates the probability of the vehicle being ‘D’ at t given being ‘H’ at time t-1.

3. Paper structure

1) Review Previous Markov Chain Simulation of Electric Vehicle

2) Introduce the survey data, the 2000 UK Time of Use Survey (TUS) data

3) A matrix representation of the transition diagram at time t, Tt, is shown by Eq. (1)

An example of the state transition matrix at 8:40 am (t = 29, t0=4am, 4am+29*10min=8:50am) is shown in Eq. (2),

Verification of proposed MCMC method by convergence analysis.

4) Distribution grid case study by OpenDSS software (Case 1 commercial, Case 2 residential).

Case 1: A University building at Strathclyde, accommodates up to 300 workers, and has a nominal parking availability for approximately 100 cars. This building is supplied by a dedicated 1000 kVA transformer.

Case 2: low-voltage single-phase domestic network that consists of 17 households.

Fig. 2. Case 2 Single phase distribution network layout.

3.Paper Results

Results Description: 24 hour Load (KVA) profile in grids, before and after EV connected.

Fig. 3. (Upper)Aggregate demand of workplace EV charging. And (Lower) averaged voltage profile for Household 17 with 99% CI under full EV penetrations.


Case 1 : An office building, approximately 100 cars, 100% EV penetration level, that is, 100 out of 100 cars are EV. This building is supplied by a 1000 kVA transformer.

For Case 1, a 1000 kVA transformer would easily survive the extra EV load for both standard and fast charging cases. A more typical transformer for this building with rating of nearer 500 kVA would, however, fail to supply the EV related load in the fast charging scenario.

Case 2: low-voltage single-phase community consists of 17 households.

For Case2, EV penetration in this case causes a severe voltage violation of the network (with specified tolerance of [−0.06 +1.10] p.u.,

4. Summarization

1) Markov Chain Monte Carlo simulation, as a numerical approach, can be used to generate different electricity load profiles according to various EV charging schemes.

2) The impact of the additional EV charging loads on the local distribution network can be assessed by identifying the expected value and associated uncertainty, as measured by the standard deviation, for various grid operational metrics, such as thermal loading, voltage profiles, transformer loss of life, energy losses, and harmonic distortion levels.

3) The uncertainty identification of these different metrics requires large number of trials from MCMC simulation to achieve convergence. These uncertainties could not be generated directly by sampling from the original TUS dataset due to its size limitation.

4) Also, the same steps of MCMC approach, as described in this work, can be applied to new data sets for extracting their own inherent statistical characteristics.


The EV movement was simulated using a discrete-state, discrete-time Markov chain for four events {D, H, W, C}, correspond to ‘driving’, ‘parking at home’, ‘parking at workplace’, and ‘parking at commercial areas’ respectively

The model can be extending to EV Charging States, including V2G and G2V, and further implemented in reinforcement leaning problems.


  • Useful Datasets download link (1&2):

1. National Household Travel Survey :

2. The United Kingdom 2000 Time Use Survey. National Statistics Technical Report; 2003.


  • Review of previous Markov Chain × EV works by author

Table 1. Summary of relevant literature works.

This work
  • A: Fine data resolution (less or equal to 10 min per step).
  • B: Vehicle status definition.
  • C: Vehicle movement simulation.
  • D: Vehicle use pattern verification.
  • E: Detailed network impact analyses considering charging location.
  • F: Uncertainty analysis of detailed network impact.
  • ✓: model feature is included in a suitable manner.
  • ✗: model feature not included.
  • —: not relevant.


[1] T.-K. Lee, Z. Bareket, T. Gordon, Z.S. FilipiStochastic modeling for studies of real-world PHEV usage: driving schedule and daily temporal distributions IEEE Trans Veh Technol, 61 (4) (May 2012), pp. 1493-1502

[2] F.J. Soares, J.P. Lopes, P.R. Almeida, C.L. Moreira, L. SecaA stochastic model to simulate electric vehicles motion and quantify the energy required from the grid PSCC, Stockholm, Sweden (2011)

[3]  Iversen EB, Møller JK, Morales JM, Madsen H. Inhomogeneous Markov models for describing driving patterns. IEEE Trans Power Syst.

[4]  A. Lojowska, D. Kurowicka, G. Papaefthymiou, L. van der Sluis Stochastic modeling of power demand due to EVs using copula IEEE Trans Power Syst, 27 (4) (2012), pp. 1960-1968

[5]  A. Ashtari, E. Bibeau, S. Shahidinejad, T. MolinskiPEV charging profile prediction and analysis based on vehicle usage data IEEE Trans Smart Grid, 3 (1) (2012), pp. 341-350

[6]  A.D. Hilshey, P.D. Hines, P. Rezaei, J.R. DowdsEstimating the impact of electric vehicle smart charging on distribution transformer aging IEEE Trans Smart Grid, 4 (2) (2013), pp. 905-913

[7]       F. Rassaei, W.S. Soh, K.C. ChuaDemand response for residential electric vehicles with random usage patterns in smart grids IEEE Trans Sustain Energy, 6 (4) (2015), pp. 1367-1376

[8]       Fluhr J, Ahlert KH, Weinhardt C. A stochastic model for simulating the availability of electric vehicles for services to the power grid. In: System Sciences (HICSS), 43rd Hawaii International Conference on. IEEE; 2010. p. 1–10.

[9]       S. Shafiee, M. Fotuhi-Firuzabad, M. RastegarInvestigating the impacts of plug-in hybrid electric vehicles on power distribution systems IEEE Trans Smart Grid, 4 (3) (2013), pp. 1351-1360

[10]     Wang Y, Huang S, Infield D. Investigation of the potential for electric vehicles to support the domestic peak load. In: Electric Vehicle Conference (IEVC), IEEE. Dec. 2014. p. 1–8.


<End of Review>


Written by,

Qiyun(Kevin) Dang

SUMMARY OF: DeepMPC: Learning Deep Latent Features for Model Predictive Control

SUMMARY OF: DeepMPC: Learning Deep Latent Features for Model Predictive Control

  • DOI:10.15607/RSS.2015.XI.012
  • Source: Lenz, Ian et al. “DeepMPC: Learning Deep Latent Features for Model Predictive Control.” Robotics: Science and Systems (2015).

Literature Review by S.Seal

Edited by Q.Dang, D.Wu []

Disclaimer: This website contains copyrighted material, and its use is not always specifically authorized by the copyright owner. Take all necessary steps to ensure that the information you receive from the post is correct and verified.

1. Paper Motivation

Human intuitions in solving a problem are hard to replicate in robotics. For complex non-linear dynamics such as robotic food cutting, difficulties are faced in designing controllers specifically when the system dynamics vary temporally as well as with its surrounding environmental properties. In this article the authors have implemented deep learning to generate a recurrent conditional deep predictive model for a model predictive controller (MPC) used in robotic food cutting [1].
While MPC has already been proven efficient in solving control problems in various fields, the difficulty mostly lies in its implementation since it involves rigorous prediction optimization as each time step with considerably complex system model that sufficiently represents the dynamic system state transition with time in response to the control inputs. However, with rapid advancements in the field of machine learning, available system data can be exploited to design a simpler yet accurate system models that sufficiently approximates the system behaviours and generate reliable predictions for the MPC. In this article, the authors have showcased that deep architecture can help improve the performance of MPC and its real time implementation.

2. Main Contributions

  1. DeepMPC: Online continuous-space real-time feedforward MPC using novel deep architecture which models system dynamics conditioned on learned latent system properties.
  2. Novel multi-stage pre-training learning algorithm for recurrent network which avoids over fitting problem and the “exploding gradient” problem.
  3. Multiplicative conditional interactions and temporal recurrence are used to model inter-material and time varying intra-material characteristics.
  4. Instead of using temporally local information this model uses learned recurrent features to integrate long-term information and model unobserved system properties.
  5. Implementation for real-time application. Fast inference with prediction horizon 1s = 100 samples, gradient evaluation at 1.2kHz.

3. Method

A.      Problem definition:

文本框: Figure 1: End-effector gripper with axes used in [1]

Figure 1: End-effector gripper with axes used in [1]

Figure 2: Block diagram of DeepMPC [1]

The objective is to cut the food items of different varieties, along Z direction using a force applied along the end-effector X axis.

B.      Modelling of time-varying nonlinear dynamics for the MPC prediction model with deep networks

i.            Dynamic response features:

1.       Basic input features for the deep predictive model incorporate both control inputs as well as system states (output for the prediction model).

2.       To capture higher-order and delayed-responses in the model time-blocks are used to train the model instead of single timestep data.

ii.            Conditional dynamic responses: to incorporate both short-term and long-term information in modelling local system dynamics three sets of features are considered,

1.       Current control inputs

2.       Past time block’s dynamic response

3.       Latent features modelling long-term observation.

iii.            Long term recurrent latent features: transforming recurrent units (TRUs) are introduced that retains state information from previous observations by using

1.       Outputs from previous TRU.

2.       Short-term response features from current and past time blocks.

C.       Learning and inference

i.            Three step learning:

1.       Phase 1: Unsupervised pre training (similar to the sparse auto-encoder algorithm) – to obtain a good initial estimation of latent features and train the non-recurrent parameters of transforming recurrent unit (TRU).

2.       Phase 2: Single step prediction training (2nd pre training stage) – trains to predict a single timestep in the future. Recurrent weights from TRU are set to zero. Minimizes prediction error for initial set of selection for model parameters i.e. weights. Generates the pre-trained set of initial parameter values.

3.       Phase 3: Warm-latent recurrent training – set of initial parameters from Phase 2 is used for initializing the recurrent prediction system which generates system state predictions. The system is then optimized to minimize the sum-squared prediction error for finite time horizon using algorithm similar to backpropagation-through-time.

While implementing online, the model is trained for warm start where the latent system states are propagated for a few time blocks without any optimization or prediction.

ii.            Inference: The trained model is then recurrently used to predict future system states for a finite time horizon by using predicted system states, latent states and control inputs for subsequent time blocks. No online optimization is necessary for inference.

D.      Online MPC

i.            Offline prediction process: As described earlier, model parameters from the deep predictive model are fed to the optimization process offline.

ii.            Control process:

1.       Calculated end-effector (EE) pose using forward kinematics.

2.       Stiffness control for restoring forces along axes not controlled by MPC.

3.       Implements joint torques received form the shared memory space as optimized MPC control signals.

4.       Updated the EE pose in the shared memory space to be used by the optimization process.

iii.            Optimization process:

1.       System model parameters: available offline

2.       MPC cost function parameters: adjustable online

a.       penalizes the knife motion along the X and Z axis.

b.       generates gradient w.r.t states which is subsequently used by the model to generate a gradient with respect to control inputs i.e. forces using the backpropagation through time.

3.       The gradients with respect to forces are then optimized by a gradient descent-based algorithm to generate the control signal which is used by the control process from the shared memory space.

E.      Dataset

  i.            Large-scale dataset of 1488 material cuts for 20 different classes.

  ii.            Over 450 real-time robotic experiments.


A.      Prediction experiments:


1.       Linear state-space model, ARMAX model with weights on past states, K-nearest neighbour (5-NN) model.

2.       Also compared with linear gaussian mixer model (GMM), Gaussian process (GP)model and the proposed model trained with GPML package (

As shown in Figure 2 [1], the proposed prediction model outperforms the baseline methods. It gives 95% confidence interval of prediction error.

Figure 3 [1]: Prediction error: Mean L2 distance (in mm) from predicted to ground-truth trajectory from 0.01s to 0.5s in the future [1].

B.      Robotic experiments:


1.       Class-generic stiffness controller

2.       Class-specific stiffness controllers

3.       An algorithm presented in [2] where class-specific material properties are mapped to haptic clusters.

Figure 4 [1]: Mean cutting rates, with bars showing normal standard deviation, for ten diverse materials Red bar uses the same controller for all materials, blue bar uses the same for each cluster given by [2], purple uses a tuned stiffness controller for each, and green is online MPC method proposed in [1].

This approach showed 46% improved accuracy as compared to a standard recurrent deep network. Related experimental videos and discussion can be found in [1]

5. Suggested future work

  • The deep prediction model for MPC as proposed in this article can be useful for different non-linear applications for example, in building energy management where implementing MPC needs building specific prediction model. With deep predictive model for MPC, available seasonal forecast data, time-of-use and control data from existing control system can be used to model different types of buildings. Adaptive training of the deep predictive model can help in generalizing the MPC designing for building sector.


[1] I. Lenz, R. Knepper, and A. Saxena, “DeepMPC: Learning Deep Latent Features for Model Predictive Control,” in Robotics: Science and Systems XI, 2015.
[2] M. C. Gemici and A. Saxena, “Learning haptic representation for manipulating deformable food objects,” in 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 638–645.


<End of Review>


Written by,

Sayani Seal