
If we expand the update quantity Uₜ₊₁Vₜ₊₁ — UₜVₜ under the SGD scenario, the result will be:
Assuming the η² term can be disregarded as a higher-order term, what remains is:
From this perspective, compared to full fine-tuning with SGD, LoRA replaces the full gradient ∂ L/∂ Wₜ with the result inside the parentheses.
For simplicity, let’s focus on the case where r = 1. Note that in the above formula, the projection vectors Uₜ and Vₜ at time t depend on t. What would happen if we replaced them with random vectors independent of t (randomly regenerated at each training step)? Considering u, v ∼ N(0,1), where u ∈ R^{m × 1} and v ∈ R^{1 × n} , the update becomes:
It can be proven that:
Where I_{m × m} and I_{n × n} are identity matrices of size m × m and n × n respectively. Thus, akin to “zeroth-order gradients”, on average, this LoRA approach, which reinitializes at every step, is equivalent to full-rank SGD. However, if implemented in this manner, its speed might even be slower than full-rank SGD. Hence, its purpose isn’t acceleration, but potentially mitigating catastrophic forgetting — by using low-rank matrices (instead of full-rank) for updates on individual (batch) samples, it minimizes the impact on the entire model’s weights. Of course, this is speculative, and the author has yet to experiment with its actual performance.
Firstly, considering the case where r = 1, LoRA essentially assumes Δ w_{i,j} = uᵢ vⱼ. Can we make other low-rank decomposition assumptions? For instance, Δ w_{i,j} = uᵢ + vⱼ ? In matrix form, this is expressed as:
Where 1_{1 × n} and 1_{m × 1} are matrices of size 1 × n and m × 1 filled with ones, respectively. The gradient for this is easily derived as:
Compared to the original LoRA, this additive decomposition has two advantages:
- Addition has a lower computational cost than multiplication, and its gradient form is simpler.
- The rank of UV is always 1, but the rank of U 1_{1 × n} + 1_{m × 1} V might be 2. If rank represents model capability, then for the same number of parameters, the expressive power of the additive form might be stronger. As for its actual performance, the author will conduct comparative experiments when using LoRA in the future.
Can this additive decomposition be extended to the case where r > 1? Naturally, it can, but with a slight twist. Assuming both m and n are divisible by r, we can modify the parameterization to:
Here, I_{r(1 × n / r)} and I_{r(1 × n / r)} are block matrices of size 1 × n/r and m/r × 1, respectively, where each block is an r × r identity matrix. In essence, this approach treats U and V as block matrices of size m/r × 1 and 1 × n/r, respectively, and then applies the r = 1 logic to them.