Intelligent anomaly detection in unbalanced industrial data using the XGBoost model and genetic algorithm (GA) to optimize performance in identifying defective products in the production linee

Document Type : Original Article

Authors

1 Ph.D. student - Industrial Management, Operational Research - Azad University, North Tehran branch

2 Associate Professor, Faculty of Engineering, Islamic Azad University, South Tehran Branch

3 Islamic Azad University South Tehran Branch Assistant professor

4 Assistant Professor, Department of Industrial Management - Islamic Azad University, North Tehran Branch

10.48308/jimp.2025.235743.1564

Abstract

The production line process and its sequence are one of the basic approaches in planning industrial products in bulk. Lack of planning in lines and appropriate solutions for optimizing effective systems in the production and assembly process leads to an increase in the time allocated to production and an increase in machine downtime, resulting in a decrease in the number of products produced in terms of number and production rate, inefficiency of allocated and available resources, and as a result, an increase in system costs, all of which ultimately lead to low productivity and loss of available resources. In this research, the main goal is to identify anomalies in the semiconductor wafer production process using machine learning methods.The data used includes various features of production wafers collected from a large manufacturer in the semiconductor industry and contains information about the state of the wafers in the production process. In order to improve the performance of the model and reduce the negative effects of outliers, the winsurization method was used to adjust values that are very far from the average in some features.

Methods. In this study, using data preprocessing methods and also simulation in Python software, an attempt was made to increase the accuracy of the model in identifying anomalies. The first step was to prepare the data and remove or adjust outliers. Because some features included many values and a large number of outliers that could bias the model, the "winsurization" method was used.Winsorization means that very large and very small values of each feature are limited to certain thresholds to reduce their impact on the model's performance. Another key step in this project was to reduce the dimensionality of the data. Given that this dataset contains 1558 features, processing and analyzing all these features requires significant computational resources and may make the model more complex than necessary. Therefore, using the "Linear Discriminant Analysis (LDA)" method, the data dimensions were reduced to a lower dimensional space to create a better separation between normal and abnormal classes. This dimensionality reduction helps the model to classify the data more accurately and also simplifies computational processing.This study shows that by using appropriate data preprocessing techniques and machine learning models, successful results can be achieved in identifying production anomalies and preventing defective products from entering the market.

Findings. After data preparation, the standard table of orthogonal arrays in the Taguchi method is used to standardize the data. L9(34) orthogonal arrays are selected as the most appropriate design for models three to six. Then, the research data is used to identify anomalies using the XGBoost model and the genetic algorithm and compare the two models. The performance of the model was evaluated using the confusion matrix and the ROC curve and the efficiency of the genetic algorithm. The results showed that the model has a high ability to identify anomalies and the value under the curve AUC was obtained equal to 0.97.Next, in order to further optimize and manage the challenge of data imbalance, Genetic Algorithm (GA) was used as an evolutionary approach to adjust the feature weights and classification threshold. These results indicate the ability of the model to distinguish healthy and defective samples with high accuracy. This research shows that by using appropriate data preprocessing techniques and machine learning models, successful results can be achieved in identifying manufacturing anomalies and identifying defective parts.

Conclusion. The results obtained from this research showed that the XGBoost method has a high ability in detecting anomalies. Also, the genetic algorithm has been able to improve performance metrics such as precision (92.4%), recall (0.924), and score (0.913) and provide stable convergence over different generations.The combination of XGBoost and Genetic Algorithm (GA) allows for more accurate identification of anomalies and shows that this approach can be used as a practical framework in improving quality control, reducing waste, and increasing the efficiency of production lines

Keywords

Main Subjects