Is there significance in the fact that a value is missing?
“Missing” means what …
Most learning algorithms deal with missing values but they may make different assumptions about them.
What to do with Missing values
- Omit instances where the attribute value is missing?
- Treat “missing” as a separate possible value?
Remove all attributes with 33% or more missing values if this missing value is not significant.
In general, it's better to replace missing values rather than delete them entirely, since in many cases these attributes will contribute some useful information.
In Weka, the ReplaceMissingValues filter replaces missing values in numerical attributes by the average value, and replaces missing values in nominal attributes by the mode, i.e., the most popular value. With this method, the means and modes are calculated over the whole dataset. Thus for each fold of the cross-validation, some of the attribute values in the training set have been contaminated with information from the test set (although the effect is probably very small). This could produce results that are slightly different from those obtained from a completely independent test set in which missing values are replaced by means/modes from that test set.