chendelong/DirectSAM-1800px-0424

Direct Segment Anything Model (DirectSAM) in the paper "Subobject-level Image Tokenization" by Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, and Pascale Fung.

Model. We use a Segformer as the backbone, which has a total of 84.6M parameters. We replace the final multi-way classifier with a one-way classifier, and perform full-parameter fine-tuning.
Data. We use the SA-1B dataset to train the DirectSAM. The mask annotations are converted to boundaries via running opencv-implemented contour detection and plotting the extracted contours with a line width of 3. Random Gaussian blur is appllied with a probability of 0.25.
Training. We train DirectSAM on the SA-1B dataset with a single-node 8xNVIDIA A100 (80GB) server. We first train it with an input resolution of 1024x1024 for one epoch, then for another 0.6 epoch with 1800x1800 resolution (the maximum resolution for data parallel training on 80GB GPUs). For the first 1024x1024 epoch, we use a per GPU batch size of 4, gradient accumulation of steps of 4, and a learning rate of 4e-4. For the second 1800x1800 epoch, we use a per GPU batch size of 1, gradient accumulation of steps of 8, and a learning rate of 2e-4. These two epochs respectively take around 15 days and 20 days.

Please see our GitHub repo for more information.