🌻
Models
  • Step by step intro
  • Bash
  • Git
    • Remove folder
  • Embedding
    • Normalize Input
    • One-hot
  • Hyperparameter tuning
    • Test vs Validation
    • Bias vs Variance
    • Input
      • Normalize input
      • Initialize weight
    • Hidden Layer
      • Hidden layer size
    • Learning Rate
      • Oscillate learning rate
      • Learning rate finder
    • Batch Size
    • Epoch
    • Gradient
      • Vanishing / Exploding Gradients
      • Gradient Checking
    • Cost Function
      • Loss Binary Cross Entropy
    • Regularization
      • Lā‚‚ regularization
      • L₁ regularization
      • Dropout regularization
      • Data augmentation
      • Early stopping
  • Fine-tuning
    • Re-train on new data
    • Freeze layer/weight
  • Common Graphing Stats
    • Confidence interval (CI) and error bar
    • Confusion matrix and type I type II error
    • Effect size
  • Models
    • Inverted Pendulum Model
    • Recurrent Neural Networks
      • GRU and LSTM
      • Neural Turing Machines
    • Hopfield
    • Attention
      • Re-attention
      • Enformer
    • Differential Equations
      • Ordinary Differential Equations
        • Language Ordinary Differential Equations (ODE)
        • Neural Ordinary Differential Equations (ODE)
          • Adjoint Sensitive Method
          • Continuous Backpropagation
          • Adjoint ODE
      • Partial Differential Equations
      • Stochastic Differential Equations
    • Knowledge Tracing Models
      • Bayesian Knowledge Tracing
    • trRosetta
    • Curve up grades
  • deeplearning.ai
    • Neural Networks and Deep Learning
      • Wk2 - Python Basics with Numpy
      • Wk2 - Logistic Regression with a Neural Network mindset
      • Wk3 - Planar data classification with a hidden layer
      • Wk4 - Building your Deep Neural Network: Step by Step
      • Wk4 - Deep Neural Network - Application
    • Hyperparameter Tuning, Regularization and Optimization
      • Wk1 - Initialization
      • Wk1 - Regularization
      • Wk1 - Gradient Checking
    • Structuring Machine Learning Projects
    • Convolutional Neural Networks
    • Sequence Models
  • Neuroscience Paper
    • Rotation and Head Direction
    • Computational Models of Memory Search
    • Bayesian Delta-Rule Model Explains the Dynamics of Belief Updating
    • Sensory uncertainty and spatial decisions
    • A Neural Implementation of the Kalman Filter
    • Place cells, spatial maps and the population code for memory (Hopfield)
    • Spatial Cognitive Map
    • Event Perception and Memory
    • Interplay of Hippocampus and Prefrontal Cortex in Memory
    • The Molecular and Systems Biology of Memory
    • Reconsidering the Evidence for Learning in Single Cells
    • Single Cortical Neurons as Deep Artificial Neural Networks
    • Magnetic resonance-based eye tracking using deep neural networks
Powered by GitBook
On this page

Was this helpful?

  1. Models
  2. Differential Equations
  3. Ordinary Differential Equations
  4. Neural Ordinary Differential Equations (ODE)

Adjoint ODE

PreviousContinuous BackpropagationNextPartial Differential Equations

Last updated 3 years ago

Was this helpful?

Question: Minimize F(x,p)F(x, p)F(x,p)

F(x,p)=∫0Tf(x,p,t)dt(1)F(x, p) = \int_{0}^{T}f(x,p,t)dt \tag1F(x,p)=∫0T​f(x,p,t)dt(1)

Subject to

g(x0,p)=x0āˆ’p=0(2)g(x_0,p) = x_0 - p =0 \tag2g(x0​,p)=x0ā€‹āˆ’p=0(2)

-> If given x0x_0x0​, we can compute ppp at t=0t=0t=0 through g(x0,p)g(x_0, p)g(x0​,p), then substitue xtx_txt​ and ppp in h(xt,xĖ™t,p,t)h(x_t, \dot{x}_t, p, t)h(xt​,xĖ™t​,p,t) and can compute xĖ™t\dot{x}_t xĖ™t​

Why is āˆ‚xāˆ‚p\frac{\partial{x}}{\partial p}āˆ‚pāˆ‚x​ difficult to calculate? Because fn(x,p)fn(x,p)fn(x,p) is unknown, need to solve ALL possible xxx and ppp using ODE

Apply Lagrangian function L(x,Ī»)=f(x)āˆ’Ī»g(x)\mathcal{L} (x, \lambda)= f(x)- \lambda g (x)L(x,Ī»)=f(x)āˆ’Ī»g(x) and combine (1) (2) (3) in one loss function

Loss=∫0T[f(x,p,t)+Ī»Th(xt,xĖ™t,p,t)]dt+uTg(x0,p)(4)Loss = \int_{0}^{T}[f(x,p,t) + \lambda^Th(x_t, \dot{x}_t, p, t)]dt + u^T g(x_0,p) \tag4Loss=∫0T​[f(x,p,t)+Ī»Th(xt​,xĖ™t​,p,t)]dt+uTg(x0​,p)(4)

Substitute (2) (3)

Why does integral equals loss? Why is this loss function? Minimize loss?

Loss=∫0T[f(x,p,t)+Ī»T0]dt+uT0=∫0Tf(x,p,t)dt=F(x,p)(5)Loss = \int_{0}^{T}[f(x,p,t) + \lambda^T0]dt + u^T 0 = \int_{0}^{T}f(x,p,t)dt = F(x, p) \tag5Loss=∫0T​[f(x,p,t)+Ī»T0]dt+uT0=∫0T​f(x,p,t)dt=F(x,p)(5)
āˆ‚Lāˆ‚p=āˆ‚Fāˆ‚p(6)\frac{\partial{L}}{\partial p} = \frac{\partial{F}}{\partial p} \tag6āˆ‚pāˆ‚L​=āˆ‚pāˆ‚F​(6)

Why calculate āˆ‚Lāˆ‚p\frac{\partial{L}}{\partial p}āˆ‚pāˆ‚L​during backprop? To use Newton's Method to approximate f(x) at a given point

āˆ‚Lāˆ‚p=āˆ‚Fāˆ‚p=∫0T[āˆ‚fāˆ‚xā‹…āˆ‚xāˆ‚p+āˆ‚fāˆ‚p+Ī»T(āˆ‚hāˆ‚xā‹…āˆ‚xāˆ‚p+āˆ‚hāˆ‚xĖ™ā‹…āˆ‚xĖ™āˆ‚p+āˆ‚hāˆ‚p)]dt+uT(āˆ‚gāˆ‚x0ā‹…āˆ‚x0āˆ‚p+āˆ‚gāˆ‚p)\frac{\partial{L}}{\partial p} = \frac{\partial{F}}{\partial p} = \int_{0}^{T}[\frac{\partial{f}}{\partial x}\cdot\frac{\partial{x}}{\partial p}+\frac{\partial{f}}{\partial p}+\lambda^T(\frac{\partial{h}}{\partial x}\cdot\frac{\partial{x}}{\partial p}+\frac{\partial{h}}{\partial \dot{x}}\cdot\frac{\partial{\dot{x}}}{\partial p}+\frac{\partial{h}}{\partial p})]dt+u^T(\frac{\partial{g}}{\partial x_0}\cdot\frac{\partial{x_0}}{\partial p}+\frac{\partial{g}}{\partial p})āˆ‚pāˆ‚L​=āˆ‚pāˆ‚F​=∫0T​[āˆ‚xāˆ‚fā€‹ā‹…āˆ‚pāˆ‚x​+āˆ‚pāˆ‚f​+Ī»T(āˆ‚xāˆ‚hā€‹ā‹…āˆ‚pāˆ‚x​+āˆ‚xĖ™āˆ‚hā€‹ā‹…āˆ‚pāˆ‚x˙​+āˆ‚pāˆ‚h​)]dt+uT(āˆ‚x0ā€‹āˆ‚gā€‹ā‹…āˆ‚pāˆ‚x0​​+āˆ‚pāˆ‚g​)

To avoid compute āˆ‚xĖ™āˆ‚p\frac{\partial{\dot{x}}}{\partial p}āˆ‚pāˆ‚x˙​ , we apply integration by parts ∫udv=uvāˆ’āˆ«vdu\int u dv = u v - \int v du∫udv=uvāˆ’āˆ«vdu

āˆ‚Lāˆ‚p=āˆ‚Lāˆ‚p\frac{\partial{L}}{\partial p} = \frac{\partial{L}}{\partial p}āˆ‚pāˆ‚L​=āˆ‚pāˆ‚L​
integral equals loss
approximation using updated derivative