Model Zoo --------- The following table provides an overview of the models available in DLFeat. Performance metrics (Acc., mAP, R@1, etc.) are typically reported on standard benchmarks (e.g., ImageNet, Kinetics-400, GLUE, MSCOCO, MSR-VTT). FLOPS and Speed are indicative and can vary significantly based on hardware, batch size, input resolution, and specific implementation. "SSL" denotes Self-Supervised Learning. "Multimodal" models are trained on multiple data types. .. list-table:: DLFeat Model Zoo :widths: 12 28 8 20 8 10 12 12 :header-rows: 1 * - Modality - Model Name (Identifier) - Feat. Dim - Performance (Benchmark) - FLOPS (G) - Speed - Supervision - Source * - **Image** - - - - - - - * - Image - `resnet18` - 512 - 69.76% (ImageNet Top-1) - 1.8 - Fast - Supervised - torchvision * - Image - `resnet34` - 512 - 73.30% (ImageNet Top-1) - 3.6 - Fast - Supervised - torchvision * - Image - `resnet50` - 2048 - 80.86% (ImageNet Top-1, tv) - 4.1 - Medium - Supervised - torchvision_or_timm * - Image - `resnet101` - 2048 - 81.88% (ImageNet Top-1, tv) - 7.8 - Medium - Supervised - torchvision_or_timm * - Image - `resnet152` - 2048 - 82.28% (ImageNet Top-1, tv) - 11.5 - Slower - Supervised - torchvision_or_timm * - Image - `efficientnet_b0` - 1280 - 77.69% (ImageNet Top-1) - 0.39 - Very Fast - Supervised - timm * - Image - `efficientnet_b2` - 1408 - 80.51% (ImageNet Top-1) - 1.0 - Fast - Supervised - timm * - Image - `efficientnet_b4` - 1792 - 83.37% (ImageNet Top-1) - 4.4 - Medium - Supervised - timm * - Image - `mobilenet_v2` - 1280 - 71.88% (ImageNet Top-1) - 0.3 - Very Fast - Supervised - torchvision * - Image - `mobilenet_v3_small` - 576 - 67.67% (ImageNet Top-1) - 0.06 - Very Fast - Supervised - torchvision * - Image - `mobilenet_v3_large` - 960 - 74.04% (ImageNet Top-1) - 0.22 - Very Fast - Supervised - torchvision * - Image - `vit_tiny_patch16_224` - 192 - 75.4% (ImageNet Top-1, DeiT) - 1.3 - Fast - Supervised (DeiT) - timm * - Image - `vit_small_patch16_224` - 384 - 81.2% (ImageNet Top-1, DeiT) - 4.6 - Medium - Supervised (DeiT) - timm * - Image - `vit_base_patch16_224` - 768 - 85.2% (ImageNet Top-1, MAE FT) - 17.6 - Medium - SSL (MAE) - timm * - Image - `dinov2_base` - 768 - 82.8% (ImageNet k-NN, ViT-B/14) - ~33 (ViT-B/14) - Medium - SSL (DINOv2) - Transformers * - **Video** - - - - - - - * - Video - `r2plus1d_18` - 512 - 65.2% (K400 Top-1) - 31.6 (16f) - Medium - Supervised - torchvision * - Video - `video_swin_t` - 768 - 78.8% (K400 Top-1) - 48 (32x224^2) - Medium - Supervised - torchvision * - Video - `video_swin_s` - 768 - 81.6% (K400 Top-1) - 92 (32x224^2) - Slower - Supervised - torchvision * - Video - `video_swin_b` - 1024 - 82.7% (K400 Top-1) - 199 (32x224^2) - Slower - Supervised - torchvision * - Video - `videomae_base_k400_pt` - 768 - 81.2% (K400 Top-1, ViT-B) - ~168 (16x224^2) - Medium - Supervised (PT+FT) - transformers * - **Audio** - - - - - - - * - Audio - `wav2vec2_base` - 768 - ~6.9% (LibriSpeech WER, no LM) - 94.5M Params - Fast - SSL (Wav2Vec2) - Transformers * - Audio - `ast_vit_base_patch16_224` - 768 - 0.459 (AudioSet mAP) - 87M Params - Medium - Supervised - Transformers * - **Text** - - - - - - - * - Text - `sentence-bert` - 384 - 85.3 (STS-B Spearman) - N/A - Very Fast - SSL (SBERT) - sentence-transformers * - Text - `bert_base_uncased` - 768 - 79.6 (GLUE Avg.) - 110M Params - Medium - SSL (BERT) - Transformers * - **Multimodal** - - - - - - - * - Image-Text - `clip_vit_b32` - 512 - 63.3% (ImageNet zero-shot) - N/A - Fast - Multimodal SSL - Transformers * - Video-Text - `xclip_base_patch16` - 512 - 46.7% (MSR-VTT R@1) - N/A - Medium - Multimodal SSL - Transformers *Note on FLOPS/Speed: These are highly approximate and depend on input size, hardware, and batching. "Fast" might mean >100 FPS for images on a modern GPU. "N/A" indicates data not readily found or highly variable. For parameter counts, "M Params" refers to millions of parameters.*