Image Classification Using Google Vision Transformer (ViT)

Overview: Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. Source: https://huggingface.co/google/vit-base-patch16-224

Input Image

Prediction Output

  1. fountain
    0.186
  2. bell cote, bell cot
    0.131
  3. cinema, movie theater, movie theatre, movie house, picture palace
    0.071
  4. planetarium
    0.059
  5. monitor
    0.058