Untitled

mail@pastecode.io avatar
unknown
plain_text
7 months ago
1.4 kB
1
Indexable
Never
Backbones: In the context of deep learning, a backbone refers to the convolutional neural network (CNN) architecture that is used as a feature extractor. These backbones are pre-trained models that can be utilized for various computer vision tasks.

Training Details:

donut-base:

Training Resources: 64 A100 GPUs were used for training, and it took approximately 2.5 days.
Architecture: The encoder has a total of four blocks with varying numbers of layers: {2,2,14,2}. The decoder has 4 layers.
Input Size: The model is designed to handle images of size 2560x1920.
Swin Window Size: The Swin Transformer window size is set to 10.
Datasets: The model was trained on IIT-CDIP (11 million images) and SynthDoG datasets in multiple languages (English, Chinese, Japanese, Korean with 0.5 million images each).
donut-proto:

Training Resources: 8 V100 GPUs were used for training, and it took approximately 5 days.
Architecture: The encoder has four blocks with varying numbers of layers: {2,2,18,2}. The decoder has 4 layers.
Input Size: The model is designed to handle images of size 2048x1536.
Swin Window Size: The Swin Transformer window size is set to 8.
Datasets: The model was trained on the SynthDoG dataset in multiple languages (English, Japanese, Korean with 0.4 million images each).
Additional Notes:

donut-proto: Described as a preliminary model, suggesting it may be an early version or prototype of the model.
Leave a Comment