Department of Computer Science Faculty Scholarship and Creative Works

XML Clustering By Principal Component Analysis

Jianghui Liu, New Jersey Institute of Technology
Jason T.L. Wang, New Jersey Institute of TechnologyFollow
Wynne Hsu, National University of Singapore
Katherine Herbert, Montclair State UniversityFollow

Document Type

Conference Proceeding

Publication Date

12-1-2004

Abstract

XML is increasingly important in data exchange and information management. A large amount of efforts have been spent in developing efficient techniques for storing, querying, indexing and accessing XML documents. In this paper we propose a new approach to clustering XML data. In contrast to previous work, which focused on documents defined by different DTDs, the proposed method works for documents with the same DTD. Our approach is to extract features from documents, modeled by ordered labeled trees, and transform the documents to vectors in a high-dimensional Euclidean space based on the occurrences of the features in the documents. We then reduce the dimensionality of the vectors by principal component analysis (PCA) and cluster the vectors in the reduced dimensional space. The PCA enables one to identify vectors with co-occurrent features, thereby enhancing the accuracy of the clustering. Experimental results based on documents obtained from Wisconsin's XML data bank show the effectiveness and good performance of the proposed techniques.

DOI

10.1109/ICTAI.2004.122

Montclair State University Digital Commons Citation

Liu, Jianghui; Wang, Jason T.L.; Hsu, Wynne; and Herbert, Katherine, "XML Clustering By Principal Component Analysis" (2004). Department of Computer Science Faculty Scholarship and Creative Works. 633.
https://digitalcommons.montclair.edu/compusci-facpubs/633

This document is currently not available here.

COinS

Department of Computer Science Faculty Scholarship and Creative Works

XML Clustering By Principal Component Analysis

Document Type

Publication Date

Abstract

DOI

Montclair State University Digital Commons Citation

Search

Browse

Author Corner

Links

Department of Computer Science Faculty Scholarship and Creative Works

XML Clustering By Principal Component Analysis

Authors

Document Type

Publication Date

Abstract

DOI

Montclair State University Digital Commons Citation

Share

Search

Browse

Author Corner

Links

//<![CDATA[ document.write("<a href='mailto:" + "digitalcommons" + "@" + "mail.montclair.edu" + "'>" + "Contact Us" + "<\/a>") //]]>