Very interesting work ! But as you highlight it, the authors of the transformer model did not choose the proper validation metric. So what do you think the outcomes of your study would be if the models were compared using a more appropriate metric like say the area under the precision-recall curve or f1-score ?
Also the sample size seems pretty low for justifying the usage of a transformer model… I guess deep learning for tabular data might start to shine and show its full potential only when considering way bigger datasets