Categorical Feature Encoding Techniques for Improved Classifier Performance when Dealing with Imbalanced Data of Fraudulent Transactions


  • Dalia Breskuvienė Data Science and Digital Technologies Institute, Vilnius University, Lithuania
  • Gintautas Dzemyda Data Science and Digital Technologies Institute, Vilnius University, Lithuania



imbalanced data, classifier, feature encoding, high-cardinality, fraud detection


Fraudulent transaction data tend to have several categorical features with high cardinality. It makes data preprocessing complicated if categories in such features do not have an order or meaningful mapping to numerical values. Even though many encoding techniques exist, their impact on highly imbalanced massive data sets is not thoroughly evaluated.

Two transaction datasets with an imbalance lower than 1\% of frauds have been used in our study. Six encoding methods were employed, which belong to either target-agnostic or target-based groups. The experimental procedure has involved the use of several machine-learning techniques, such as ensemble learning, along with both linear and non-linear learning approaches.

Our study emphasizes the significance of carefully selecting an appropriate encoding approach for imbalanced datasets and machine learning algorithms. Using target-based encoding techniques can enhance model performance significantly. Among the various encoding methods assessed, the James-Stein and Weight of Evidence (WOE) encoders were the most effective, whereas the CatBoost encoder may not be optimal for imbalanced datasets. Moreover, it is crucial to bear in mind the curse of dimensionality when employing encoding techniques like hashing and One-Hot encoding.


Alarfaj, F. K.; Malik, I.; Khan, H. U.; Almusallam, N.; Ramzan, M.; Ahmed, M. (2022). Credit Card Fraud Detection Using State-of-the-Art Machine Learning and Deep Learning Algorithms, IEEE Access, 10, 39700-39715, 2022.

Altman, E. (2021). Synthesizing credit card transactions, 2nd ACM International Conference on AI in Finance (ICAIF'21), [Online].

Alonso Lopez-Rojas, E.; Axelsson, S. (2014). BankSim: A Bank Payment Simulation for Fraud Detection Research, The 26th European Modeling and Simulation Symposium, [Online]. Available:

Breiman, L. (1984). Classification and Regression Trees (1st ed.). Routledge.

Breiman, L. (2001). Random Forests, Machine Learning 45, 5-32, 2001.

Breskuvien˙e, D.; Dzemyda, G. (2023). Imbalanced Data Classification Approach Based on Clustered Training Set, In: Dzemyda, G., Bernatavičien˙e, J., Kacprzyk, J. (eds) Data Science in Applications. Studies in Computational Intelligence, Springer, Cham. 1084, 43-62, 2023.

Bourdonnaye, F.; Daniel, F. (2021). Evaluating categorical encoding methods on a real credit card fraud detection database, [Online]. Available: 2021.

Bugajev, A.; Kriauzien˙e, R.; Vasilecas, O.; Chadyšas, V. (2022). The Impact of Churn Labelling Rules on Churn Prediction in Telecommunications. Informatica, 33(2), 247-277, 2022.

Bulavas, V.; Marcinkevičius, V.; Rumiński, J. (2021). Study of Multi-Class Classification Algorithms' Performance on Highly Imbalanced Network Intrusion Datasets, Informatica, 32(3), 441-475, 2021.

Chalé, M.; Bastian, N. D. (2022). Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems, Expert Systems with Applications, 207, 117936, 2022.

Carneiro, E. M.; Forster, C. H. Q.; Mialaret, L. F. S.; Dias, L. A. V.; Cunha, A. M. (2022). High-Cardinality Categorical Attributes and Credit Card Fraud Detection, Mathematics, 10(20), 2022.

Chen, C.; Liaw, A.; Breiman, L. (2004). Using random forest to learn imbalanced data, University of California, Berkeley (110), 1-12, 2004.

Chen, T.; Guestrin, C. (2016). XGBoost: A scalable tree boosting system, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794, 2016.

Dorogush, A. V.; Ershov, V.; Gulin, A. (2018). CatBoost: gradient boosting with categorical features support, [Online]. Available:

Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D.; Fernández-Delgado, A. (2014). Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?, [Online]. Available:

Johnson, J. M.; Khoshgoftaar, T. M. (2020). Hcpcs2Vec: Healthcare Procedure Embeddings for Medicare Fraud Prediction, 2020 IEEE 6th International Conference on Collaboration and Internet Computing, 145-152, 2020.

Johnson, J. M.; Khoshgoftaar, T. M. (2021). Encoding Techniques for High-Cardinality Features and Ensemble Learners, 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science, 355-361, 2021.

Jordon, J. et al. (2022) Synthetic Data - what, why and how?, [Online]. Available:

Ke, G. et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree, [Online]. Available:

Micci-Barreca, D. (2001). A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems, ACM SIGKDD Explorations Newsletter, 3(1), 2001.

Moeyersoms, J.; Martens, D. (2015). Including high-cardinality attributes in predictive models: A case study in churn prediction in the energy sector, Decision Support Systems, 72, 72-81, 2015.

Najadat, H.; Altiti, O.; Aqouleh, A. A.; Younes, M. (2020). Credit Card Fraud Detection Based on Machine and Deep Learning, 11th International Conference on Information and Communication Systems, 204-208, 2020.

Pargent, F.; Pfisterer, F.; Thomas, J.; Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features, Computational Statistics, 37(5), 2671-2692, 2022.

Peng, Y.; Qiu, Q; Zhang, D.; Yang, T.; Zhang H.(2023). Ensemble Learning for Interpretable Concept Drift and Its Application to Drug Recommendation, International Journal of Computers Communications & Control, 18(1), 5011, 2023.

Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A. V.; Gulin, A. (2017). CatBoost: unbiased boosting with categorical features, [Online]. Available:

Reilly, D.; Taylor, M.; Fergus, P.; Chalmers, C.; Thompson, S. (2022). The Categorical Data Conundrum: Heuristics for Classification Problems - A Case Study on Domestic Fire Injuries, IEEE Access, 10, 70113-70125, 2022.

Russac, Y.; Caelen, O.; He-Guelton, L. (2018). Embeddings of Categorical Variables for Sequential Data in Fraud Context, Advances in Intelligent Systems and Computing

Sagi,O.; Rokach, L. (2018). Ensemble learning: A survey, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), 2018.

Slakey, A.; Salas, D.; Schamroth, Y. (2019). Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine, [Online]. Available:

Surowiecki, J. (2004). The wisdom of crowds, Anchor, 2004.

Turhan, B. (2012). On the dataset shift problem in software engineering prediction models, Empirical Software Engineering, 17(1-2), 62-74, 2012.

Uyar, A.; Bener, A.; Ciray, H. N.; Bahceci, M. (2009). A frequency based encoding technique for transformation of categorical variables in mixed IVF dataset, 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society: Engineering the Future of Biomedicine, 6214-6217, 2009.

Wang, H.; Wang, W.; Liu, Y.; Alidaee, B. (2022). Integrating Machine Learning Algorithms With Quantum Annealing Solvers for Online Fraud Detection, IEEE Access, 10,75908-75917, 2022.

Zhao, X.-M.; Li, X.; Chen, L.; Aihara, K. (2007). Protein classification with imbalanced data, Proteins, 70(4), 1125-1132, 2007.

Zhou, X. (2015). Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, Sociol Methodology, 45(1), 320-356, 2015.

Additional Files



Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.