Imputation Framework for Missing Values

K. Raja; G. Tholkappia Arasu; Chitra. S. Nair

doi:https://doi.org/10.14445/22312803/IJCTT-V3I2P104

Research Article | Open Access | Download PDF

Volume 3 | Issue 2 | Year 2012 | Article Id. IJCTT-V3I2P104 | DOI : https://doi.org/10.14445/22312803/IJCTT-V3I2P104

Imputation Framework for Missing Values

K. Raja, G. Tholkappia Arasu ,Chitra. S. Nair

Citation :

K. Raja, G. Tholkappia Arasu ,Chitra. S. Nair, "Imputation Framework for Missing Values," International Journal of Computer Trends and Technology (IJCTT), vol. 3, no. 2, pp. 215-219, 2012. Crossref, https://doi.org/10.14445/22312803/IJCTT-V3I2P104

Abstract

Missing values may occur for several reasons and affects the quality of data, such as malfunctioning of measurement equipment, changes in experimental design during data collection, collation of several similar but not identical datasets and also when respondents in a survey may refuse to answer certain questions such as age or income. Missing values in datasets can be taken as a common problem in statistical analysis. This paper first proposes the analysis of broadly used methods to treat missing values which are either continuous or discrete. And then, an estimator is advocated to impute both continuous and discrete missing target values. The proposed method is evaluated to demonstrate that the approach is better than existing methods in terms of classification accuracy.

Keywords

Classification, data mining, methodologies.

References

[1] J. Racine and Q. Li, “Nonparametric Estimation of Regression Functions with Both Categorical and Continuous Data,” J. Econometrics, vol. 119, no. 1, pp. 99-130, 2004.
[2] R. Little and D. Rubin, Statistical Analysis with Missing Data, second ed. John Wiley and Sons, 2002.
[3] J. Barnard and D. Rubin, “Small-Sample Degrees of Freedom with Multiple Imputation,” Biometrika, vol. 86, pp. 948-955, 1999.
[4] A. Dempster, N.M. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. vol. 39, pp. 1-38, 1977.
[5] K. Cios and L. Kurgan, “Knowledge Discovery in Advanced Information Systems,” Trends in Data Mining and Knowledge Discovery, N. Pal, L. Jain, and N. Teoderesku, eds., Springer, 2002.
[6] S.C. Zhang et al., “Missing Is Useful: Missing Values in Cost- Sensitive Decision Trees,” IEEE Trans. Knowledge and Data Eng.,vol. 17, no. 9, pp. 1689-1693, Dec. 2005
[7] G. John et al., “Ir-Relevant Features and the Subset Selection Problem,” Proc. 11th Int’l Conf. Machine Learning, W. Cohen and H. Hirsch, eds., pp. 91-99, 1994
[8] A. Dempster and D. Rubin, Incomplete Data in Sample Surveys:Theory and Bibliography, W.G. Madow, I. Olkin, and D. Rubin, eds., vol. 2, pp. 3-10, Academic Press, 1983
[9] D. Rubin, Multiple Imputation for Nonresponse in Surveys. Wiley, 1987.
[10] J .R. Quinlan, C4.5: Programs for Machine Learning. Morga Kaufmann, 1993.
[11] C. Peng and J. Zhu, “Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression”, Educational and Psychological Measurement, vol. 68, no. 1, pp. 58-77, 2008.
[12] W. Zhang, “Association Based Multiple Imputation in Multivariate Data Sets: A Summary,” Proc. Int’l Conf. Data Eng. (ICDE), p.310, 2000.
[13] Q.H. Wang and R. Rao, “Empirical Likelihood-Based Inference under Imputation for Missing Response Data,” Annals of Statistics,vol. 18, pp. 896-912, 2002.
[14] V.C. Raykar and R. DuraiswamiFast, “Fast Optimal Bandwidth Selection for Kernel Density Estimation”,Proc. SIAM Int’l Conf. Data Mining (SDM ’06), pp. 512-511, 2006.
[15] C. Zhang, X. Zhu, J. Zhang, Y. Qin, and S. Zhang, “GBKII: An Imputation Method for Missing Values,” Proc. 11th Pacific-Asia Knowledge Discovery and Data Mining Conf. (PAKDD ’07), pp. 1080-1087, 2007.
[16] Shichao Zhang, Zhi Jin and Zhuoing Xu, “Missing Value Estimation for Mixed-attribute Data Sets”, IEEE Trans.Knowledge and Data Eng., vol.23, no.1, Jan 2011.