Image Classification Using Google Vision Transformer (ViT)

Overview: Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. Source: https://huggingface.co/google/vit-base-patch16-224

Input Image

Prediction Output

  1. miniature poodle
    0.318
  2. toy poodle
    0.309
  3. Maltese dog, Maltese terrier, Maltese
    0.104
  4. Lhasa, Lhasa apso
    0.039
  5. dumbbell
    0.012