Define adjoint state
a(t)=∂z(t)∂Loss(1) From t to t+ϵ (ϵ change in time) we have
z(t+ϵ)=∫tt+ϵf(z(t),t,θ)∂t+z(t)=Tϵ(z(t),t)(2) And because of chain rule ( ∂x∂y=∂u∂y∂x∂u )
a(t)=a(t+ϵ)∂z(t)∂Tϵ(z(t),t)(3) Take the definition of derivative:
∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t)(4) Substitue (3) in (4)
∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t+ϵ)∂z(t)∂Tϵ(z(t),t)(5) ∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t+ϵ)∂z(t)∂Tϵ(z(t),t)(6) Taylor series around z(t) in (6)
∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t+ϵ)∂z(t)∂(z(t)+ϵf(z(t),t,θ)+O(ϵ2))(7) aka Tϵ(z(t),t) to z(t)+ϵf(z(t),t,θ)+O(ϵ2) when limϵ→0
aka when ϵ change in time is small, take range ϵ−0=ϵ and become ϵf(z(t),t,θ), to make up for the loss add O(ϵ2) at the end (notice it is related to ϵ)
Expand (7)
∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t+ϵ)(∂z(t)∂z(t)+∂z(t)∂ϵf(z(t),t,θ)+O(ϵ2))(8) aka ∂z(t)∂z(t)=I
∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t+ϵ)(I+∂z(t)∂ϵf(z(t),t,θ)+O(ϵ2))(9) ∂t∂a(t)=ϵ→0limϵa(t+ϵ)−a(t+ϵ)(I+ϵ∂z(t)∂f(z(t),t,θ)+O(ϵ2))(10) ∂t∂a(t)=ϵ→0limϵ−a(t+ϵ)ϵ∂z(t)∂f(z(t),t,θ)+O(ϵ2)(11) aka a(t+ϵ)−a(t+ϵ)I=0
∂t∂a(t)=ϵ→0lim−a(t+ϵ)∂z(t)∂f(z(t),t,θ)+O(ϵ)(12) and because limϵ→0
∂t∂a(t)=−a(t)∂z(t)∂f(z(t),t,θ)(13)