Tiny Scale Is All I Can Spare To Play With Transformer
Signal
35
Hype
25
In three linesIndian student proposes merging Attention and FFN to reduce parameters (<10M) without performance loss. Replaces static SwiGLU linear matrices with dynamic attention. Limited experiments (0.8M in 8-10h, 4M in 3-4 days on personal PC) due to resource constraints.Read source
Your take?
Summary generated by Claude — human-verified