ADVANCED MULTIVARIATE STATISTICAL METHODS FOR HIGH-DIMENSIONAL DATA MODELING, PREDICTION, AND INTERPRETATION
Keywords:
Classification, Data Mining, Financial Crimes, Tax Evasion, TransparencyAbstract
Tax evasion and financial crimes are two issues that are here and will always stay a thorn in the flesh of any economic stability and the distrust of a population in the fiscal systems, but current detection techniques are often not analytical enough to find the hidden, convoluted trends in mass financial data. In spite of the global progress in forensic accounting and regulatory practices there has always remained a glaring gap in terms of integrating sophisticated data mining methods to effectively identify anomalies in a variety of datasets. This paper aimed to overcome this weakness by creating and implementing a multi-layered analytical model combining survival analysis, penalized regression, machine learning classification and multivariate diagnostics to identify tax evasion and other financial crimes in the United States. Three heterogeneous data were analyzed, genomic-style high-dimensional financial records (n=200, p=5000), institutional transaction data (n=5200, p=300) and survey-based socioeconomic indicators (n=2500, p=220). The problem of missing data were addressed with multiple imputation, multicollinearity were handled with the variance inflation factor thresholds and principal component reduction, and robust statistical analyses were performed, including Cox regression, Elastic Net regression, Random Forest classification, and MANOVA. Findings showed that the genomic-style dataset produced 12 significant predictors of fraudulent patterns (HR=1.51, 95% CI: 1.22 -1.88, p=0.011), whereas the financial dataset produced high predictive power with Elastic Net (RMSE=2.78%) relative to the baseline OLS (RMSE=3.45%). Random Forest AUC 0.83 was obtained with survey-based modeling, which is better than other classifiers. Clinical-style covariates were integrated to verify the independent contributions of variables related to frauds (C-index=0.72). These results underscore the ability of state-of-the-art data mining to increase promptness of financial crime detection, decrease false positives, and serve regulatory policy. The research adds a repeatable framework that enhances rigor of the methodology and enriches the literature of evidence-based detection of financial crimes.