Here are the topics that are offered to CentraleSupélec 3rd year students for the period 2021-2022.

The project aims at comparing the performance of hybrid networks with traditional approaches: GBM (XGBoost, LightGBM), Stochastic Processes (Hawkes), Interval Prediction [4] and DeepNN (1D-CNN, LSTM, …). Comparisons will be made on different criteria, including: tubular or sequential data, with trend (cross-entropy), regression (RMSE) or domain related metrics (Sortino, Omega-ratio).

In this project, you will:

- study the new models [1], [2], [3] and [4],
- implement the techniques proposed in the papers and set up a testbench using the provided toolkit (utilities, metrics, data processing).

[1] Discrete Event, Continuous Time RNNs; Mozer, 2017 [2] Transformer Hawkes Process; Zuo, 2021 [3] The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process; Mei & Eisner, 2017 [4] High-Quality Prediction Intervals for Deep Learning: A Distribution-Free, Ensembled Approach; Pearce, 2018

The project aims at analyzing the variations in the environment that lead to actions of an RL agent. It is an approach similar to the explicability work in classification, with specific literature on reinforcement learning [1] [2], to be built up. Pre-trained models of some algorithms can be provided in order to focus the work on analysis. The comparison with manual management will focus on the types of movements and their effects, especially in technical analysis using TA-Lib [3].

During this work, you will:

- study explainability techniques in RL and draw the state of the art in strategy analysis techniques of an RL agent,
- implement techniques and set up a testbench using the toolkit provided (utilities, metrics, data processing, backtesting).

[1] Visualizing and Understanding Atari Agents; Greydanus, 2018 [2] Uncovering Surprising Behaviors In Reinforcement Learning Via Worst-case Analysis; Ruderman, 2019 [3] TA-Lib, python wrapper

Recent publications seem to challenge the hegemony of Gradient Boosted Trees methods (e.g. XGBoost) with respect to tabular data, in favor of Deep Learning methods [1] [2] [3]. Indeed, if Deep Learning methods have shown their incredible efficiency on images or text, their ability to correctly process more standard data formats like tabular data is still a very active research topic. The objective of the project will be to apply these different methods to fraud data, synthetic at first, then on real data provided by LUSIS, in order to evaluate their efficiency. Indeed, if the methods proposed in [1] [2] [3] have been proven on balanced data, the question of their performance on unbalanced data is still open.

During this work, you will analyze the impact of data imbalance on model performance (Deep Learning vs. Gradient Boosted Trees) on synthetic data. The interest of synthetic data here is the possibility to vary the fraudulent payment rate in order to analyze the impact. For each of the fraud rates considered:

- implement a baseline for shallow models (XGBoost and CatBoost)
- implement a baseline for deep models (ResNet as proposed in [2], and MLP of the form Linear-ReLU-Dropout).
- implement at least one of the three approaches proposed in the bibliography.
- optimize hyperparameters
- compare performance Afterwards, an analysis of the performances of the different models on real data proposed by LUSIS can also be proposed.

[1] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka. Regularization is all you need: Simple neural nets can excel on tabular data, 2021. [2] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko. Revisiting deep learning models for tabular data, 2021. [3] G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and T. Goldstein. Saint: Improved neural networks for tabular data via row attention and contrastive pre-training, 2021.

Ex-post methods that seek to provide an explanation for the prediction of a black-box model, such as LIME [1] or ANCHOR [2], cannot be used effectively on fraud data. Indeed, by construction, the data is extremely unbalanced with a large majority of non-fraudulent payments. This makes the sampling methods used in these models inefficient and the relevance of their explanations very uncertain. Therefore, a certain number of other methods that exist in the literature can be tested in order to evaluate their relevance on fraud data. In particular, we can think of the method proposed in [3] which aims at providing a set of rules obtained by maximizing both accuracy (predictive capacity of the model) and interpretability. We can compare the rules obtained with the help of the metrics proposed in this paper, to the model proposed in [4] which proposes a method to obtain a decision list in which the set of rules is ordered.

We propose to implement models presented in [3] and [4]. Then try to modify the model in [3] to include the notion of diversity in the values taken by the categorical features (e.g. a merchant code belonging to a set of values and not a single value). Finally compare the results obtained with relevant metrics found in the literature.

[1] Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin, . ““Why Should I Trust You?”: Explaining the Predictions of Any Classifier.” . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016 (pp. 1135–1144).2016. [2] Marco Tulio Ribeiro, , Sameer Singh, and Carlos Guestrin. " Anchors: High-Precision Model-Agnostic Explanations ." . In AAAI Conference on Artificial Intelligence (AAAI). 2018. [3] Lakkaraju, Himabindu, Stephen H., Bach, and Jure, Leskovec. “Interpretable Decision Sets: A Joint Framework for Description and Prediction.” . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1675–1684). Association for Computing Machinery, 2016. [4] Aoga, Pierre. “Finding Probabilistic Rule Lists using the Minimum Description Length Principle.” . In Discovery Science (pp. 66–82). Springer International Publishing, 2018.