When we load in 4 bit, the linear layers are replaced with linear 4bit layers. These layers have half the number of parameters. But still I am also not clear how number of parameters become half.
This is normal since torch.int4 data dtype is not supported in PyTorch. Instead, we pack the 4bit data into torch.int8 tensor, hence the number of parameters is divided by 2 when we quantize in 4 bit !
From @marcsun13