Back to feed
Reddit r/MachineLearning·

Custom image encoder [P]

Signal
35
Hype
15
In three linesDeveloper asks whether building a custom image encoder is better than CLIP/SigLIP/DINO for video frame classification. Pipeline: 15 frames/30s → embeddings → Transformer 1.5-9M params. Constraints: speed (CLIP-S0: 10 img/s on 4 vCPUs) and CPU-only deployment. Considers custom encoder trained on proprietary dataset (millions of images, 4-5 labels).
Read source
Your take?
EmbeddingsVisionFine-tuning

Summary generated by Claude — human-verified