Profile

I am a data mining and machine learning researcher in the Computational Biology group of the Janssen Pharmaceutical Companies of Johnson & Johnson. I study and develop algorithms for drug discovery to treat diseases.

I was a postdoctoral researcher in INRIA, MAGNET group in Lille, France, during the period from March 2017 to mid of June 2018, under the supervision of Jan Ramon. There, I was investigating on decentralised and privacy-aware learning from mobility data.

I received my PhD from the Machine Learning group, DTAI lab, Department of Computer Science, KU Leuven. My PhD promotors were Prof.Luc De Raedt, Prof.Siegfried Nijssen and Prof.Kathleen Marchal. I obtained MSc of ICT from Asian Institute of Techonology (AIT), Thailand, where I was supervised by Prof.Peter Haddawy.

Research expertises and interests

Within machine learning, I have expertise in Probabilistic Graphical Models, Bayesian modelling, Deep Learning. I have applied Deep Neural Networks and Bayesian Matrix Factorization to predict drug-target interactions within the European Research Project named ExCAPE (http://www.excape-h2020.eu/).

Within data mining, I have expertise in explorative data mining using principles of optimization such as Constraint Programming, Integer Programming, Matrix Factorization, and model selection using Minimum Description Length. Typical work in this direction is the introduction of a general framework called Semiring Rank Matrix Factorization (sRMF) for mining different type of patterns in ranked data. We successfully applied sRMF to integrate heterogeneous data, including mutation data, CNV data, gene expression and prior knowledge encoded in protein-protein interaction networks to simultaneously discover cancer subtypes and subtype features.

I am currently interested in declarative data mining and probabilistic machine learning algorithms to discover (relational) patterns from streaming data.

My goal is to develop machine learning/data mining algorithms to address problems that matter. Main applications will possibly be in the medical domains, for example, integrating heterogeneous data, omics data and imaging data, to study cancer subtypes and cancer evolution. Other domains might include traffic and econmic networks.

News

18.10.2016 The website is online.

Recent Works

We develop matrix factorisation methods for pattern set mining in rank data, i.e., data consists of ranking of items, and apply those methods in discovering cancer subtypes and subtype specific features simultaneously.

Rank Matrix Factorisation (RMF) for pattern set mining in rank data

We introduce a generic rank matrix factorisation framework based on semiring theory for pattern set mining in rank data. In this type of data, each row is a complete or partial ranking of the columns.

We succesfully apply the RMF framework for discovering different types of patterns in rank matrices, e.g., Sparse RMF and and ranked tiling, and for cancer subtyping.

References: Le Van et al., Semiring rank matrix factorisation; Le Van et al., Rank matrix factorisation, PAKDD (2015).

Publications

1. Semiring rank matrix factorisation

Thanh Le Van, Siegfried Nijssen, Matthijs van Leeuwen and Luc De Raedt

Journal article IEEE Transactions on Knowledge and Data Engineering, DOI: 10.1109/TKDE.2017.2688374.

Abstract Bibtex

Rank data, in which each row is a complete or partial ranking of available items (columns), is ubiquitous. It has been used to represent, for instance, preferences of users, the levels of gene expression, and the outcomes of sports events. While rank data has been analysed in data mining, pattern mining in such data has so far not received much attention. To alleviate this state of affairs, we introduce a novel pattern set mining approach. It is based on a semiring matrix factorisation framework. Rather than using the traditional linear algebra for matrix factorisation we employ the max-product semiring. Our semiring matrix factorisation framework is then applied to two tasks: sparse rank matrix factorisation and ranked tiling. Algorithms for these tasks are proposed, and their effectiveness is demonstrated on extensive experiments using both synthetic and real data.

To be added.

2. Simultaneous discovery of cancer subtypes and subtype features by molecular data integration

Thanh Le Van, Matthijs van Leeuwen, Ana Carolina Fierro, Dries De Maeyer, Jimmy Van den Eynden, Lieven Verbeke, Luc De Raedt, Kathleen Marchal, and Siegfried Nijssen

Journal article Bioinformatics (2016) 32 (17): i445-i454 doi:10.1093/bioinformatics/btw434

Abstract Bibtex Pdf Code

Subtyping cancer is key to an improved and more personalized prognosis/treatment. The increasing availability of tumor related molecular data provides the opportunity to identify molecular subtypes in a data-driven way. Molecular subtypes are defined as groups of samples that have a similar molecular mechanism at the origin of the carcinogenesis. The molecular mechanisms are reflected by subtype-specific mutational and expression features.

Data-driven subtyping is a complex problem as subtyping and identifying the molecular mechanisms that drive carcinogenesis are confounded problems. Many current integrative subtyping methods use global mutational and/or expression tumor profiles to group tumor samples in subtypes but do not explicitly extract the subtype-specific features. We therefore present a method that solves both tasks of subtyping and identification of subtype-specific features simultaneously. Hereto our method integrates mutational and expression data while taking into account the clonal properties of carcinogenesis. Key to our method is a formalisation of the problem as a \emph{rank matrix factorisation} of \emph{ranked data} that approaches the subtyping problem as \emph{multi-view bi-clustering}.

To be added.

3. Rank matrix factorisation

Thanh Le Van, Matthijs van Leeuwen, Siegfried Nijssen and Luc De Raedt

Conference paper The 19th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2015). HoChiMinh City, VietNam, 19-22 May 2015 (pp. 734-746) Springer.

Abstract Bibtex Pdf Code

We introduce the problem of \emph{rank matrix factorisation} (RMF). That is, we consider the decomposition of a rank matrix, in which each row is a (partial or complete) ranking of all columns. Rank matrices naturally appear in many applications of interest, such as sports competitions. Summarising such a rank matrix by two smaller matrices, in which one contains partial rankings that can be interpreted as local patterns, is therefore an important problem.

After introducing the general problem, we consider a specific instance called Sparse RMF, in which we enforce the rank profiles to be sparse, i.e., to contain many zeroes. We propose a greedy algorithm for this problem based on integer linear programming. Experiments on both synthetic and real data demonstrate the potential of rank matrix factorisation.

To be added.

4. Ranked tiling

Thanh Le Van, Matthijs van Leeuwen, Siegfried Nijssen,Ana Carolina Fierro, Kathleen Marchal and Luc De Raedt

Conference paper The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2014 (ECML/PKDD 2014), Nancy, France, 15-19 September 2014, (pp. 98-113) Springer.

Abstract Bibtex Pdf Code

Tiling is a well-known pattern mining technique. Traditionally, it discovers large areas of ones in binary databases or matrices, where an area is defined by a set of rows and a set of columns. %However, it cannot be applied on data that is not binary. In this paper, we introduce the novel problem of {\em ranked} tiling, which is concerned with finding interesting areas in ranked data. In this data, each transaction defines a complete ranking of the columns. Ranked data occurs naturally in applications like sports or other competitions. It is also a useful abstraction when dealing with numeric data in which the rows are incomparable.% due do incompatible scales.

We introduce a scoring function for ranked tiling, as well as an algorithm using constraint programming and optimization principles. We empirically evaluate the approach on both synthetic and real-life datasets, and demonstrate the applicability of the framework in several case studies. One case study involves a heterogeneous dataset concerning the discovery of biomarkers for different subtypes of breast cancer patients. An analysis of the tiles by a domain expert shows that our approach can lead to the discovery of novel insights.

To be added.

5. Mining local staircase patterns from noisy data

Thanh Le Van, Ana Carolina Fierro, Tias Guns, Matthijs van Leeuwen, Siegfried Nijssen, Luc De Raedt and Kathleen Marchal

Workshop paper International workshop on Co-Clustering and Applications (CoClus'12) in conjunction with IEEE ICDM 2012. Brussels, 10 December 2012 (pp. 139-146) IEEE Computer Society.

Abstract Bibtex Pdf Code

Most traditional biclustering algorithms identify biclusters with no or little overlap. In this paper, we introduce the problem of identifying staircases of biclusters. Such staircases may be indicative for causal relationships between columns and can not easily be identified by existing biclustering algorithms. Our formalization relies on a scoring function based on the Minimum Description Length principle. Furthermore, we propose a first algorithm for identifying staircase biclusters, based on a combination of local search and constraint programming. Experiments show that the approach is promising.

To be added.

6. Deriving financial aid optimization models from admissions data

Thanh Le Van and Peter Haddawy

Conference paper 37th Annual Frontiers In Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports, 2007 (pp. F2A-7-F2A-12), IEEE Computer Society.

Abstract Bibtex Pdf

This paper presents a novel approach to deriving probabilistic models that predict enrollment given applicant background and the amount of financial aid offered. Our Bayesian network models can be used to optimize various enrollment objectives. We present a novel efficient optimization algorithm that uses the models to maximize expected tuition revenue under capacity constraints including student-faculty ratio and accommodation. We demonstrate and evaluate our approach using four years of graduate admissions data from the Asian Institute of Technology, consisting of 7,788 applicants from 84 different countries. This data set is particularly challenging since reliable family income data is not available for students from most of these countries. Evaluating the Bayesian network model with 10-fold cross validation yields an ROC Az value of 0.8451, with a predictive accuracy of 82.70% at a threshold of 0.5. Comparing the results of the tuition revenue optimization model to the institute's current financial aid allocation practice shows that if single-term tuition revenue is the sole optimization criterion, the institute can achieve its current enrollment numbers while realizing significant savings in its financial aid budget. The prediction and optimization software is currently being incorporated into the institute's online admissions processing system.

@INPROCEEDINGS{levan2007, author={L. V. Thanh and P. Haddawy}, booktitle={2007 37th Annual Frontiers In Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports}, title={Deriving financial aid optimization models from admissions data}, year={2007}, pages={F2A-7-F2A-12}, doi={10.1109/FIE.2007.4417959}, ISSN={0190-5848}, month={Oct}}

Software

Ranked tiling.

Ranked tiling discovers a set of ranked tiles, which are data regions having high ranks, in rank matrices. There are two models for ranked tiling. The first model uses Constraint Programming (CP) and the model developed later uses Semiring Rank Matrix Factorisation. They are both implemented in Scala using OscaR library. The CP implementation uses OscaR CP solver and the other implementation has Gurobi as the back-end solver.

Coming soon

Sparse RMF.

Sparse RMF finds a set of partial rankings repeatedly occurring in a rank matrix. Such sets of partial rankings can be used to succinctly summarise the given rank matrix and show the main categories of the rankings. Sparse RMF is implemented in Scala. It uses OscaR library to write the optimisation model for Sparse RMF, which are Integer Linear Programs, and uses Gurobi as the back-end solver.

Sourcecode

SRF - Subtyping through Ranked Factors.

SRF integrates Boolean mutation data, numeric gene expression data (microarray or RNASeq), and prior knowledge encoded in biological networks to discover cancer subtypes and subtypes features. SRF is implemented in Scala using OscaR library and Gurobi solver.

Github

Contact

MAGNET group
INRIA in Lille - France

thanh.le-van@inria.fr

thanhlv@gmail.com

+33769094168

Profile

Research expertises and interests

News

Recent Works

Rank Matrix Factorisation (RMF) for pattern set mining in rank data

Sparse RMF

Sparse RMF experiments with the European Song Festival Contests (2010-2013)

Sparse RMF experiments with the European Song Festival Contests (2010-2013)

Ranked tiling

Ranked tiling with heterogeneous TCGA breast cancer

SRF - Subtyping through Ranked Factors

Ranked tiling with heterogeneous TCGA breast cancer

Publications

1. Semiring rank matrix factorisation

2. Simultaneous discovery of cancer subtypes and subtype features by molecular data integration

3. Rank matrix factorisation

4. Ranked tiling

5. Mining local staircase patterns from noisy data

6. Deriving financial aid optimization models from admissions data

Software

Ranked tiling.

Sparse RMF.

SRF - Subtyping through Ranked Factors.

Contact