Viðburður
How Far Can Vision-Language Models Take You? Zero-Shot Detection and Data-Efficient Species Classification in Videos
Lýsing
Can pre-trained vision-language models replace task-specific training for species classification? How can we leverage the temporal nature of videos to improve performance? We explore these questions using CLIP/SigLIP architectures as both zero-shot classifiers and feature extractors. For detection, we show that prompt engineering alone achieves 99.1% accuracy without any labeled data for training. For the species classification, we combine frozen SigLIP embeddings with temporal feature aggregation and linear classifiers, achieving 96.8% macro F1, outperforming a fine-tuned ResNet-50 while using less than 1/3 of the training samples. We demonstrate the approach on underwater fish species classification from Icelandic river monitoring footage, but the method could generalize to a broader set of video classification tasks with limited labeled data.