5

1

In the paper Learning to predict by the methods of temporal differences (p. 15), the weights in the temporal difference learning are updated as given by the equation $$ \Delta w_t = \alpha \left(P_{t+1} - P_t\right) \sum_{k=1}^{t}{\lambda^{t-k} \nabla_w P_k} \tag{4} \,.$$ When $\lambda = 0$, as in TD(0), how does the method learn? As it appears, with $\lambda = 0$, there will never be a change in weight and hence no learning.

Am I missing anything?

1

At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t$ is the learning rule when $\lambda = 0$.

– nbro – 2019-06-01T16:51:36.8501He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule

exactlyrelated to the usual learning rules of (tabular) temporal difference methods, where apparently nogradientis needed? – nbro – 2019-06-01T16:58:09.5432@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else. – Dennis Soemers – 2019-06-01T17:33:29.070