Model Zoo

The following table provides an overview of the models available in DLFeat. Performance metrics (Acc., mAP, R@1, etc.) are typically reported on standard benchmarks (e.g., ImageNet, Kinetics-400, GLUE, MSCOCO, MSR-VTT). FLOPS and Speed are indicative and can vary significantly based on hardware, batch size, input resolution, and specific implementation. “SSL” denotes Self-Supervised Learning. “Multimodal” models are trained on multiple data types.

DLFeat Model Zoo

Modality

Model Name (Identifier)

Feat. Dim

Performance (Benchmark)

FLOPS (G)

Speed

Supervision

Source

Image

Image

resnet18

512

69.76% (ImageNet Top-1)

1.8

Fast

Supervised

torchvision

Image

resnet34

512

73.30% (ImageNet Top-1)

3.6

Fast

Supervised

torchvision

Image

resnet50

2048

80.86% (ImageNet Top-1, tv)

4.1

Medium

Supervised

torchvision_or_timm

Image

resnet101

2048

81.88% (ImageNet Top-1, tv)

7.8

Medium

Supervised

torchvision_or_timm

Image

resnet152

2048

82.28% (ImageNet Top-1, tv)

11.5

Slower

Supervised

torchvision_or_timm

Image

efficientnet_b0

1280

77.69% (ImageNet Top-1)

0.39

Very Fast

Supervised

timm

Image

efficientnet_b2

1408

80.51% (ImageNet Top-1)

1.0

Fast

Supervised

timm

Image

efficientnet_b4

1792

83.37% (ImageNet Top-1)

4.4

Medium

Supervised

timm

Image

mobilenet_v2

1280

71.88% (ImageNet Top-1)

0.3

Very Fast

Supervised

torchvision

Image

mobilenet_v3_small

576

67.67% (ImageNet Top-1)

0.06

Very Fast

Supervised

torchvision

Image

mobilenet_v3_large

960

74.04% (ImageNet Top-1)

0.22

Very Fast

Supervised

torchvision

Image

vit_tiny_patch16_224

192

75.4% (ImageNet Top-1, DeiT)

1.3

Fast

Supervised (DeiT)

timm

Image

vit_small_patch16_224

384

81.2% (ImageNet Top-1, DeiT)

4.6

Medium

Supervised (DeiT)

timm

Image

vit_base_patch16_224

768

85.2% (ImageNet Top-1, MAE FT)

17.6

Medium

SSL (MAE)

timm

Image

dinov2_base

768

82.8% (ImageNet k-NN, ViT-B/14)

~33 (ViT-B/14)

Medium

SSL (DINOv2)

Transformers

Video

Video

r2plus1d_18

512

65.2% (K400 Top-1)

31.6 (16f)

Medium

Supervised

torchvision

Video

video_swin_t

768

78.8% (K400 Top-1)

48 (32x224^2)

Medium

Supervised

torchvision

Video

video_swin_s

768

81.6% (K400 Top-1)

92 (32x224^2)

Slower

Supervised

torchvision

Video

video_swin_b

1024

82.7% (K400 Top-1)

199 (32x224^2)

Slower

Supervised

torchvision

Video

videomae_base_k400_pt

768

81.2% (K400 Top-1, ViT-B)

~168 (16x224^2)

Medium

Supervised (PT+FT)

transformers

Audio

Audio

wav2vec2_base

768

~6.9% (LibriSpeech WER, no LM)

94.5M Params

Fast

SSL (Wav2Vec2)

Transformers

Audio

ast_vit_base_patch16_224

768

0.459 (AudioSet mAP)

87M Params

Medium

Supervised

Transformers

Text

Text

sentence-bert

384

85.3 (STS-B Spearman)

N/A

Very Fast

SSL (SBERT)

sentence-transformers

Text

bert_base_uncased

768

79.6 (GLUE Avg.)

110M Params

Medium

SSL (BERT)

Transformers

Multimodal

Image-Text

clip_vit_b32

512

63.3% (ImageNet zero-shot)

N/A

Fast

Multimodal SSL

Transformers

Video-Text

xclip_base_patch16

512

46.7% (MSR-VTT R@1)

N/A

Medium

Multimodal SSL

Transformers

Note on FLOPS/Speed: These are highly approximate and depend on input size, hardware, and batching. “Fast” might mean >100 FPS for images on a modern GPU. “N/A” indicates data not readily found or highly variable. For parameter counts, “M Params” refers to millions of parameters.