Submitted Abstract
In many real projects that rely on data analytics, the most time-consuming problems are not algorithms and models, but data processing. The data is often not only dirty (e.g. has missing entries, inconsistent formats, etc.), but lacks labels, is imbalanced, and has unknown shifts and drifts. Data wrangling and cleaning is a major step in any data science application. Data availability and quality is much more a limiting constraint for data analysis than the machine learning techniques at disposal.In this project we develop approaches and algorithms to help our partner, LOGOS, to overcome these limitations by augmenting the available data and developing cost-sensitive methods to obtain labels and train more impactful models from available samples by automatically identifying which samples will be most important in decision making. LOGOS provides a product for fraud analytics such as KYC and AML for banks. In this application, due to the lacking data quality for supervised learning approaches, hand-crafted rules to detect suspicious activities still dominate. Unsupervised methods are the common approach to supplement the carefully crafted rules. In this project, we will first leverage recent advantages in unsupervised, generative representation learning to improve the data quality. These methods extract a latent space from the input that can be navigated and sampled. This does not only provide compact descriptions of the input to replace manual feature engineering but can also beused for realistic imputation and synthetic oversampling.Secondly, we will supplement these methods with active learning to selectively obtain labeled data and maximize the usefulness of the samples available. To minimize the effort in data labeling, we will develop a cost-loss model for the decisions (which ones) and optimize the active learning queries to maximize impact while minimizing cost, accounting for all stakeholders: the bank as a customer of LOGOS, the bank’s customer, and LOGOSas the service provider.Finally, combining both representations/imputations with the ability to selectively obtain new data, we candrastically improve the utilization of the data available and thus improve the type of models and algorithms learnedto detect suspicious transactions.