Model Zoo
The following table provides an overview of the models available in DLFeat. Performance metrics (Acc., mAP, R@1, etc.) are typically reported on standard benchmarks (e.g., ImageNet, Kinetics-400, GLUE, MSCOCO, MSR-VTT). FLOPS and Speed are indicative and can vary significantly based on hardware, batch size, input resolution, and specific implementation. “SSL” denotes Self-Supervised Learning. “Multimodal” models are trained on multiple data types.
Modality |
Model Name (Identifier) |
Feat. Dim |
Performance (Benchmark) |
FLOPS (G) |
Speed |
Supervision |
Source |
|---|---|---|---|---|---|---|---|
Image |
|||||||
Image |
resnet18 |
512 |
69.76% (ImageNet Top-1) |
1.8 |
Fast |
Supervised |
torchvision |
Image |
resnet34 |
512 |
73.30% (ImageNet Top-1) |
3.6 |
Fast |
Supervised |
torchvision |
Image |
resnet50 |
2048 |
80.86% (ImageNet Top-1, tv) |
4.1 |
Medium |
Supervised |
torchvision_or_timm |
Image |
resnet101 |
2048 |
81.88% (ImageNet Top-1, tv) |
7.8 |
Medium |
Supervised |
torchvision_or_timm |
Image |
resnet152 |
2048 |
82.28% (ImageNet Top-1, tv) |
11.5 |
Slower |
Supervised |
torchvision_or_timm |
Image |
efficientnet_b0 |
1280 |
77.69% (ImageNet Top-1) |
0.39 |
Very Fast |
Supervised |
timm |
Image |
efficientnet_b2 |
1408 |
80.51% (ImageNet Top-1) |
1.0 |
Fast |
Supervised |
timm |
Image |
efficientnet_b4 |
1792 |
83.37% (ImageNet Top-1) |
4.4 |
Medium |
Supervised |
timm |
Image |
mobilenet_v2 |
1280 |
71.88% (ImageNet Top-1) |
0.3 |
Very Fast |
Supervised |
torchvision |
Image |
mobilenet_v3_small |
576 |
67.67% (ImageNet Top-1) |
0.06 |
Very Fast |
Supervised |
torchvision |
Image |
mobilenet_v3_large |
960 |
74.04% (ImageNet Top-1) |
0.22 |
Very Fast |
Supervised |
torchvision |
Image |
vit_tiny_patch16_224 |
192 |
75.4% (ImageNet Top-1, DeiT) |
1.3 |
Fast |
Supervised (DeiT) |
timm |
Image |
vit_small_patch16_224 |
384 |
81.2% (ImageNet Top-1, DeiT) |
4.6 |
Medium |
Supervised (DeiT) |
timm |
Image |
vit_base_patch16_224 |
768 |
85.2% (ImageNet Top-1, MAE FT) |
17.6 |
Medium |
SSL (MAE) |
timm |
Image |
dinov2_base |
768 |
82.8% (ImageNet k-NN, ViT-B/14) |
~33 (ViT-B/14) |
Medium |
SSL (DINOv2) |
Transformers |
Video |
|||||||
Video |
r2plus1d_18 |
512 |
65.2% (K400 Top-1) |
31.6 (16f) |
Medium |
Supervised |
torchvision |
Video |
video_swin_t |
768 |
78.8% (K400 Top-1) |
48 (32x224^2) |
Medium |
Supervised |
torchvision |
Video |
video_swin_s |
768 |
81.6% (K400 Top-1) |
92 (32x224^2) |
Slower |
Supervised |
torchvision |
Video |
video_swin_b |
1024 |
82.7% (K400 Top-1) |
199 (32x224^2) |
Slower |
Supervised |
torchvision |
Video |
videomae_base_k400_pt |
768 |
81.2% (K400 Top-1, ViT-B) |
~168 (16x224^2) |
Medium |
Supervised (PT+FT) |
transformers |
Audio |
|||||||
Audio |
wav2vec2_base |
768 |
~6.9% (LibriSpeech WER, no LM) |
94.5M Params |
Fast |
SSL (Wav2Vec2) |
Transformers |
Audio |
ast_vit_base_patch16_224 |
768 |
0.459 (AudioSet mAP) |
87M Params |
Medium |
Supervised |
Transformers |
Text |
|||||||
Text |
sentence-bert |
384 |
85.3 (STS-B Spearman) |
N/A |
Very Fast |
SSL (SBERT) |
sentence-transformers |
Text |
bert_base_uncased |
768 |
79.6 (GLUE Avg.) |
110M Params |
Medium |
SSL (BERT) |
Transformers |
Multimodal |
|||||||
Image-Text |
clip_vit_b32 |
512 |
63.3% (ImageNet zero-shot) |
N/A |
Fast |
Multimodal SSL |
Transformers |
Video-Text |
xclip_base_patch16 |
512 |
46.7% (MSR-VTT R@1) |
N/A |
Medium |
Multimodal SSL |
Transformers |
Note on FLOPS/Speed: These are highly approximate and depend on input size, hardware, and batching. “Fast” might mean >100 FPS for images on a modern GPU. “N/A” indicates data not readily found or highly variable. For parameter counts, “M Params” refers to millions of parameters.