MU Digital Repository
Logo

Improving quality of data clusters for correlated feature subset selection using multiple linear regression

Ahamed, Shafeeq B M (2015) Improving quality of data clusters for correlated feature subset selection using multiple linear regression. In: International Conference on recent trends and advancemat in information and communication engineering, March 27,2015, Nehru college of engineering, Coimbatore.

[img] PDF
icrtaice-2015.pdf - Published Version
Restricted to Registered users only

Download (647kB) | Request a copy

Abstract

Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The main requirement of the cluster analysis outcome is quality clusters. There will usually be trade-offs between the speed of clustering algorithm and the quality of the clusters it produces. A suitable clustering algorithm for an application must satisfy both the quality and speed requirements. Often, the size of the data being clustered plays an important factor in the running time of a clustering algorithm. In this paper, the cluster quality and time taken for execution is improved for census data set. The GDP of the districts are considered as a dependent variable and the other related attributes are taken as independent variables. The multiple linear regression is used to identify the more related attributes to the dependent variable. The only qualified features are retained for clustering. The K-Means clustering algorithm is used to cluster the data. The quality of the clusters is checked for raw and preprocessed data. The result shows that the quality of clusters and time taken can be improved by removing less related features. The quality of the clusters is evaluated using silhouette coefficient. The technique provides a succinct graphical representation of how well each object lies within its cluster.

Item Type: Conference or Workshop Item (Paper)
Uncontrolled Keywords: Clustering, Feature selection, multiple linear regression
Subjects: Engineering > MIT Manipal > Computer Science and Engineering
Depositing User: MIT Library
Date Deposited: 29 Jan 2016 14:22
Last Modified: 29 Jan 2016 14:22
URI: http://eprints.manipal.edu/id/eprint/145142

Actions (login required)

View Item View Item