Back to feed
Reddit r/LocalLLaMA·

Tiny Scale Is All I Can Spare To Play With Transformer

Signal
35
Hype
25
In three linesIndian student proposes merging Attention and FFN to reduce parameters (<10M) without performance loss. Replaces static SwiGLU linear matrices with dynamic attention. Limited experiments (0.8M in 8-10h, 4M in 3-4 days on personal PC) due to resource constraints.
Read source
Your take?
ReasoningPapersOpen source

Summary generated by Claude — human-verified