Data Quality Framework for Large-Scale Enterprise Data and ML Systems

Mitesh Mangaonkar

doi:https://doi.org/10.14445/22312803/ IJCTT-V72I2P116

Research Article | Open Access | Download PDF

Volume 72 | Issue 2 | Year 2024 | Article Id. IJCTT-V72I2P116 | DOI : https://doi.org/10.14445/22312803/IJCTT-V72I2P116

Data Quality Framework for Large-Scale Enterprise Data and ML Systems

Mitesh Mangaonkar

Received	Revised	Accepted	Published
09 Jan 2024	09 Feb 2024	21 Feb 2024	29 Feb 2024

Citation :

Mitesh Mangaonkar, "Data Quality Framework for Large-Scale Enterprise Data and ML Systems," International Journal of Computer Trends and Technology (IJCTT), vol. 72, no. 2, pp. 92-98, 2024. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V72I2P116

Abstract

To maintain a competitive advantage and make informed decisions, the ever-changing business data administration demands utilizing quality data. Customized for ML systems and massive quantities of business data, it presents a robust Data Quality Framework in this research. The framework's capability to accommodate diverse data types and synchronize with enterprise-level analytics is supported by a fusion of sophisticated data governance standards, exhaustive measurements, and technology. The article provides practical illustrations of the framework's operation by incorporating real-world instances from diverse sectors. A paradigm shift occurs when AI and ML techniques are combined to enhance conventional data management processes. In addition to predicting forthcoming developments in data quality management, the report concludes with strategic recommendations for organizations seeking to enhance data fidelity.

Keywords

Data quality framework, Machine Learning, Data governance, Data management.

References

[1] Mohsen Jamali, Ziv M. Williams, and Jing Cai, “Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain,” Arxiv, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Tim Nugent, Nicole Stelea, and Jochen L. Leidner, “Detecting ESG Topics using Domain-Specific Language Models and Data Augmentation Approaches,” Arxiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Michael Veale, Reuben Binns, and Lilian Edwards, “Algorithms that Remember: Model Inversion Attacks and Data Protection Law,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 376, no. 2133, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Tom B. Brown et al., “Language Models are Few-Shot Learners,” Arxiv, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Shiqing Fan et al., “DAPPLE: A Pipelined Data Parallel Approach for Training Large Models,” Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Korea, pp. 431-445, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] William Fedus, Barret Zoph, and Noam Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232-5270, 2021.
[Google Scholar] [Publisher Link]
[7] Amir Gholami et al.., “Integrated Model Batch and Domain Parallelism in Training Neural Networks,” Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures, Vienna Austria, pp. 77-86, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Priya Goyal et al., “Accurate Large Minibatch SGD: Training ImageNet in 1 Hour,” Arxiv, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yanping Huang et al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[Google Scholar] [Publisher Link]
[10] Paras Jain et al., “Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization,” Proceedings of Machine Learning and Systems, vol. 2, pp. 497-511, 2020.
[Google Scholar] [Publisher Link]
[11] Shirui Pan et al., “Unifying Large Language Models and Knowledge Graphs: A Roadmap,” IEEE Transactions on Knowledge and Data Engineering, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] H. Schwenk, “Efficient Training of Large Neural Networks for Language Modeling,” IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), Budapest, Hungary, vol. 4, pp. 3059-3064, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Deepak Narayanan et al., “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM,” SC ’21: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis, St. Louis Missouri, pp. 1-14, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Sudhashree Sayenjuet al., “Quantifying Domain Knowledge in Large Language Models,” IEEE Conference on Artificial Intelligence (CAI), Santa Clara, CA, USA, pp. 193-194, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Spyros Makridakis, Fotios Petropoulos, and Yanfei Kang, “Large Language Models: Their Success and Impact,” Forecasting, vol. 5, no. 3, pp. 536-549, 2023.
[CrossRef] [Google Scholar] [Publisher Link]