Hi!

I am using ray rllib in combination with a traffic simulator in order to train an agent that controls traffic lights.

Let’s say the min green time of a traffic light phase is 5sec and the transition time (green -> yellow -> red) is 3 sec. I.e. my agent ist interacting with the traffic lights non-periodically (sometimes 5s and sometimes 3s+5s=8s), which is different to the “normal” Markov Decision Process Framework.

In order to tackle this, I have to use the Semi Markov Decision Process Framework. For Q-learning this modifies the TD error as follows:

`r_t + gamma * r_{t+1} + ... + gamma^N * r_{t+N} + gamma^{N+1} * maxQ(s',a') - Q(s,a)`

where `r_t, r_{t+1} ...`

are equally spaced intermediate rewards, which are calculated every 1 seconds.

The calculation of `r_t + gamma * r_{t+1} + ... + gamma^N * r_{t+N} `

is done in my custom environment.

But dependent on the distance between two interactions points of my agent with the traffic lights (5s or 8s) the algorithm has to discount `maxQ(s',a')`

with a different `N`

in `gamma^{N+1}`

. How can I tell the algorithm (DQN or APEX) which discount factor to choose?

Thanks guys in advance for any suggestions!