As a member of the IEF’s Financial Innovation and Data Analytics research group, Professor Adrian Gepp presented a collaborative research paper about fraud analytics at the 43rd Eurasia Business and Economics Society (EBES) conference in Spain on April 12, 2023. The paper was judged by peer-review as the Best Paper at the conference and titled "Using data analytics to distinguish legitimate and illegitimate shell companies". He presented a novel data-driven model to detect shell companies that are being used for money laundering, which is a global problem with an estimated annual cost in the trillions. While there are multiple legitimate uses, shell companies can also be used to facilitate money laundering and so a data-driven model to quickly distinguish between them is extremely valuable. Beneficiaries of such a model include government officials and compliance professionals, particularly accountants, tax officials and anti-corruption agencies. The detection model created by Prof Gepp’s team used a hybrid data analytics approach trained on UK data using a matched sample design. The first stage involved pooling data from numerous sources into the graph database platform called Neo4j. In addition to a simple visual representation of all the data, a graph platform was chosen as it enables the identification of hidden links between a network of illicit shell companies such as common addresses and joint ownership. Graph analytics was then used to calculate quantitative scores for each node (shell company) that encapsulated information such as importance and influence within the graph network, similarity with other nodes and the presence of similar communities (smaller sub-networks). These quantitative scores were then fed into a second stage to train a supervised learning detection model. Three modern statistical learning approaches were trialled and evaluated: a single decision tree, a random forest and a boosted tree network. These three models were chosen in part because they are non-parametric given an appropriate model structure is unknown with complicated money laundering cases. Furthermore, all three techniques are tree-based, which has benefits in terms of handling outliers, interactive effects and collinearity issues.
A cornerstone of applied data analytics research is rigorous evaluation of model performance. This model was evaluated using standard train-test data partitioning and a variety of performance metrics. The accuracy of the best models is impressive with numbers well above 90%. However, it is important to consider more than simple classification accuracy as that can be a misleading measure in the presence of unbalanced data. In this research classification accuracy, area under the curve of the Receiver Operating Characteristic (ROC) curve, precision, recall and the F-Value were all investigated with the top model performing well across all of these metrics.