BertForMaskedLM train

I have a question
When training using BertForMaskedLM, is the train data as below correct?

  • token2idx
<pad> : 0, <mask>: 1, <cls>:2, <sep>:3
  • max len : 8
  • input token
 <cls> hello i <mask> cats <sep>
  • input ids
 [2, 34,45,1,56,3,0,0]
  • attention_mask
 [1,1,1,1,1,1,0,0]
  • labels
 [-100,-100,-100,64,-100,-100,-100,-100]

I wonder if I should also assign -100 to labels for padding token.

Hi,
Were you able to figure it out? I’m also trying to do the same thing.

Thanks,
Ayala

you should replace all tokens (including paddding) in labels with -100 except the masked tokens so the loss will only be calculated for masked tokens.