Enhancing Machine Learning Life Cycle through Advanced Data Engineering

Deepak Jayabalan; Shantanu Indra

doi:https://doi.org/10.14445/22312803/IJCTT-V72I4P117

Research Article | Open Access | Download PDF

Volume 72 | Issue 4 | Year 2024 | Article Id. IJCTT-V72I4P117 | DOI : https://doi.org/10.14445/22312803/IJCTT-V72I4P117

Enhancing Machine Learning Life Cycle through Advanced Data Engineering

Deepak Jayabalan, Shantanu Indra

Received	Revised	Accepted	Published
25 Feb 2024	31 Mar 2024	19 Apr 2024	30 Apr 2024

Citation :

Deepak Jayabalan, Shantanu Indra, "Enhancing Machine Learning Life Cycle through Advanced Data Engineering," International Journal of Computer Trends and Technology (IJCTT), vol. 72, no. 4, pp. 136-139, 2024. Crossref, https://doi.org/10.14445/22312803/ IJCTT-V72I4P117

Abstract

This research paper delves into the integration of advanced data engineering techniques to optimize the Machine Learning (ML) lifecycle. In today's data-driven landscape, organizations are increasingly relying on ML models for decisionmaking and predictive analytics. However, the development and deployment of ML models involve multiple complex stages, each presenting its own set of challenges. These stages include exploratory data analysis, data preparation and feature engineering, model training and tuning, model review and governance, offline evaluation, online experimentation, and deployment. Implementing changes within the ML lifecycle can be time-consuming and resource-intensive due to dependencies, complexities, and iterative experimentation cycles. This study explores how advanced data engineering strategies can address these challenges and streamline the ML lifecycle. By synthesizing existing literature and analyzing industry case studies, the research examines the impact of data engineering interventions at each stage of the ML process.
Furthermore, the study introduces key metrics, such as time to harvest, which measure the efficiency of the ML lifecycle from data collection to model deployment. It demonstrates how employing data engineering techniques can significantly reduce the time to harvest, improving operational efficiency by up to 20%. Through a comprehensive analysis, this paper provides valuable insights into the practical implications of integrating data engineering within the ML lifecycle, highlighting opportunities for innovation and optimization in ML-driven decision-making processes.

Keywords

Machine Learning, Data Engineering, Lifecycle Optimization, Data Preparation, Feature Engineering, Model Training.

References

[1] Rob Ashmore, Radu Calinescu, and Colin Paterson, “Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges,” ACM Computing Surveys, vol. 52, no. 5, pp. 1-39, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] D. Sculley et al., “Hidden Technical Debt in Machine Learning Systems,” Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal Canada, vol. 2, pp. 2503-2511, 2015.
[Google Scholar] [Publisher Link]
[3] Saleema Amershi et al., “Software Engineering for Machine Learning: A Case Study,” Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, Montreal, QC, Canada, pp. 291-300, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Neoklis Polyzotis et al., “Data Management Challenges in Production Machine Learning,” Proceedings of the 2017 ACM International Conference on Management of Data, Chicago Illinois, USA, pp. 1723-1726, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Tianqi Chen, and Carlos Guestrin, “Xgboost: A Scalable Tree Boosting System,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco California, USA, pp. 785-794, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Martín Abadi et al., “TensorFlow: A System for Large-Scale Machine Learning,” Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, Savannah GA, USA, pp. 265-283, 2016.
[Google Scholar] [Publisher Link]
[7] Fabian Pedregosa et al., “Scikit-Learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825- 2830, 2011.
[Google Scholar] [Publisher Link]
[8] Sebastian Raschka, and Vahid Mirjalili, Python Machine Learning Machine Learning and Deep Learning with Python, Scikit-Learn, and TensorFlow 2, Packt Publishing, pp. 1-772, 2019.
[Google Scholar] [Publisher Link]
[9] Nitish Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929-1958, 2014.
[Google Scholar] [Publisher Link]
[10] Geoffrey Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012.
[CrossRef] [Google Scholar] [Publisher Link]