13 Gradient Descent Functions and Hyperparameters
accompanied? : (listof tensor?)
objective-fn? : (-> theta? tensor?), as defined in Loss Functions.
id? : The identity function
inflator? : (-> tensor? accompanied?)
deflator? : (-> accompanied? tensor?)
updator? : (-> accompanied? tensor? accompanied?)
id-updator? : (-> tensor? tensor? tensor?)
Hyperparameters can be given values using with-hypers as in Hyperparameters.
procedure
→ (-> objective-fn? theta? theta?) inflate : id? deflate : id? update : id-updator? (gradient-descent inflate deflate update) → (-> objective-fn? theta? theta?) inflate : inflator? deflate : deflator? update : updator?
inflate injects a parameter tensor into an accompanied parameter.
deflate projects a parameter tensor out of an accompanied parameter.
update produces a new accompanied? from a given accompanied parameter and a gradient tensor.
The generated gradient descent function accepts an objective function and a θ and returns a revised θ after revs revisions, using gradient descent.
procedure
obj? : (-> (listof tensor?) scalar?) θ : (listof tensor?)
(λ (pa g) (- pa (* alpha g)))
procedure
obj? : (-> (listof tensor?) scalar?) θ : (listof tensor?)
(define velocity-i (λ (p) (list p (zeroes p)))) (define velocity-d (λ (pa) (ref pa 0))) (define velocity-u (λ (pa g) (let ((v (- (* mu (ref pa 1)) (* alpha g)))) (list (+ (ref pa 0) v) v))))
Here mu is the hyperparameter defining the fraction of the velocity from the past revision that is transferred to the current revision.
procedure
obj? : (-> (listof tensor?) scalar?) θ : (listof tensor?)
(define rms-i (λ (p) (list p (zeroes p)))) (define rms-d (λ (pa) (ref pa 0))) (define rms-u (λ (pa g) (let ((r (smooth beta (ref pa 1) (sqr g)))) (let ((alpha-hat (/ alpha (+ (sqrt r) epsilon)))) (list (- (ref pa 0) (* alpha-hat g)) r)))))
Here beta is the hyperparameter defining the decay rate for smoothing the square of the gradients.
procedure
obj? : (-> (listof tensor?) scalar?) θ : (listof tensor?)
(define adam-i (λ (p) (let ((zeroed (zeroes p))) (list p zeroed zeroed)))) (define adam-d (λ (pa) (ref pa 0))) (define adam-u (λ (pa g) (let ((r (smooth beta (ref pa 2) (sqr g)))) (let ((alpha-hat (/ alpha (+ (sqrt r) epsilon))) (v (smooth mu (ref pa 1) g))) (list (- (ref pa 0) (* alpha-hat v)) v r)))))
Here beta and mu are decay rates for smoothing the square of the gradients and the gradient respectively.
(+ (* decay-rate average) (* (- 1.0 decay-rate) g))