Determining automatic number of classes in hierarchical clustering by approaching experts criteria through machine learning

التفاصيل البيبلوغرافية
العنوان: Determining automatic number of classes in hierarchical clustering by approaching experts criteria through machine learning
المؤلفون: Suman, Shikha
المساهمون: Gibert, Karina, Universitat Politècnica de Catalunya. Departament d'Estadística i Investigació Operativa
المصدر: UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
بيانات النشر: Universitat Politècnica de Catalunya, 2021.
سنة النشر: 2021
مصطلحات موضوعية: Cluster analysis, Calinski-Harabasz index, Dendrogram, cluster-validity-index, Informàtica::Intel·ligència artificial [Àrees temàtiques de la UPC], Hierarchical clustering, Anàlisi de conglomerats
الوصف: Clustering is one of the most popular artificial intelligence techniques which aims at identifying groups of similar objects or patterns in the data. While there are multiple clustering techniques available in the literature, hierarchical clustering remains to be one of the most powerful and preferred choices to unveil the internal structure of the data in the form of a tree. The hierarchical clustering processes provide a dendrogram as the main output, which shows the inner similarities structure of the dataset. Deciding the correct number of clusters emerging from the dendrogram is, however, still an open problem and it remains at the disposal of human expertise in assessing the dendrogram. It is often impractical to assume that a human expert is available and with sufficient domain knowledge or technical skills to correctly determine the right number of clusters. Additionally, the dependency on a human expert also limits the practical applications of eventual automatic uses of hierarchical clustering in real-time scenarios where the objective is to capture the true nature of data that other clustering schemes often fail to do. The human judgment about dendrogram also brings the inherently human nature of variability that might vary from situation to situation, but in general, the expert assessment of dendrograms introduces some extra considerations which overcome the strict evaluation of a utility function, and that might be interesting to catch. Hence, correctly capturing and programming a method to deduce the number of clusters from a dendrogram as experts do becomes tricky.This research investigates how a human expert decides the best cut of the tree and determines the right number of clusters in a hierarchical clustering setting and proposes a new criterion that catches the human hidden criteria in this task. The research involves taking a hundred samples from real-time industrial data and takes assistance from human experts in determining the number of clusters and generating a ground truth for the thesis experimentation. Throughout the research, the Calinski-Harabasz index is used as a baseline cluster validity index being the most suitable metric when hierarchical clustering is used with Ward's linkage criterion and Euclidean distance. Five new criteria have been investigated in the thesis and evaluated over the testing dataset. The proposed criteria based on dendrogram' nodes height shows an excellent match against a human-expert driven number of clusters.The proposed cluster validity index not only overcomes the performance of other\textit{ CVI} existing in the literature to determine the number of clusters but also reduces the computational complexity by avoiding repeated runs of cluster-validity-index like Calinski-Harabasz and using intrinsic information of the dendrograms themselves. The proposed method also fits nicely into the wider research of the frame project by making hierarchical clustering suitable for even large datasets
وصف الملف: application/pdf
اللغة: English
URL الوصول: https://explore.openaire.eu/search/publication?articleId=dedup_wf_001::07c9d5dc2a567164e6d0cf08f7881390
http://hdl.handle.net/2117/347991
حقوق: OPEN
رقم الأكسشن: edsair.dedup.wf.001..07c9d5dc2a567164e6d0cf08f7881390
قاعدة البيانات: OpenAIRE