Stabilizing Transformer Coaching by Stopping Consideration Entropy Collapse

m*= Equal Contributors

Coaching stability is of nice significance to Transformers. On this work, we examine the coaching dynamics of Transformers by inspecting the evolution of the eye layers. Specifically, we observe the eye entropy for every consideration head in the course of the course of coaching, which is a proxy for mannequin sharpness. We establish a typical sample throughout completely different architectures and duties, the place low consideration entropy is accompanied by excessive coaching instability, which may take the type of oscillating loss or divergence. We denote the pathologically low consideration entropy, similar to extremely concentrated consideration scores, as entropy collapse. As a treatment, we suggest sigmaReparam, a easy and environment friendly resolution the place we reparametrize all linear layers with spectral normalization and an extra realized scalar. We show that the proposed reparameterization efficiently prevents entropy collapse within the consideration layers, selling extra steady coaching. Moreover, we show a good decrease certain of the eye entropy, which decreases exponentially quick with the spectral norm of the eye logits, offering further motivation for our strategy. We conduct experiments with sigmaReparam on picture classification, picture self-supervised studying, machine translation, automated speech recognition, and language modeling duties, throughout Transformer architectures. We present that sigmaReparam gives stability and robustness with respect to the selection of hyperparameters, going as far as enabling coaching (a) a Imaginative and prescient Transformer to aggressive efficiency with out warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to aggressive efficiency with out warmup and adaptive optimizers.

Transformers are delicate to hyperparameters. Growing the educational fee simply causes consideration entropy collapse and coaching divergence. Left: baseline Imaginative and prescient Transformer with default hyperparameters; proper: 2×learning rate(5×10−4↦1×10−3).2 occasions studying fee (5 occasions 10^-4mapsto 1 occasions 10^-3). 

[ad_2]

Leave a Reply

Your email address will not be published. Required fields are marked *