- This event has passed.
Thesis Defence: Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-Type Data
May 28, 2024 at 10:00 am - 2:00 pm
Jesse Ghashti, supervised by Dr. John Thompson, will defend their thesis titled “Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-Type Data” in partial fulfillment of the requirements for the degree of Master of Science in Mathematics.
An abstract for Jesse Ghashti’s thesis is included below.
Defences are open to all members of the campus community as well as the general public. Registration is not required for in person defences.
ABSTRACT
This thesis introduces a new kernel-based shrinkage approach to distance metric learning for mixed-type datasets–a mix of continuous, nominal, and ordinal variables. Mixed-type data is common across many research fields and domains, and is used extensively in machine learning tasks that require a comparison between data points such as distance metrics in clustering and classification. However, traditional methods for handling mixed-type data often rely on distance metrics that inadequately measure and weigh different variables types, and therefore fail to capture the complex nature of the data. This can lead to a loss in accuracy and precision in clustering and classification tasks and inevitably yield suboptimal outcomes.
To improve metric calculations, a distance metric learning approach is proposed, utilizing kernel functions as similarity, with a new optimal bandwidth selection methodology called Maximum Similarity Cross-Validation (MSCV). We demonstrate that MSCV smooths out irrelevant variables and emphasizes variables relevant for calculating distance within mixed-type datasets. Additionally, the kernel distance is positioned as a shrinkage methodology that balances the similarity calculation between maximum and uniform similarity. Further, this approach makes no assumptions on the shape of the metric, mitigating user-specification bias.
Analysis of simulated data demonstrates that the kernel metric adequately captures intricate and complex data relationships where other methodologies are limited, including scenarios where variables are irrelevant for distinguishing meaningful groupings within the data. It is shown that as the number of observations increases, the precision of the kernel metric stabilizes, as do the kernel bandwidths. Additionally, applications to real data confirm the metrics’ advantage of adequately improving accuracy across various data structures. The proposed metric consistently outperforms competing distance metrics in various clustering methodologies, showing that this flexible kernel metric provides a new and effective alternative for similarity-based learning algorithms.