Cluster Analysis
Autor: student48488 • March 19, 2014 • Essay • 387 Words (2 Pages) • 1,271 Views
When conducting cluster analysis for the numerical variables, the distance between two cases is measured utilizing the Euclidean distance. Euclidean distance is extremely influenced by the scale of each variable. Variables with larger scales (such as Market_Cap) have a stronger effect on the total distance. This issue is addressed by normalizing the continuous measurements before determining the Euclidean distance; this converts all the measurements to the same scale. Unequal weighting should be considered if it is desired to have the clusters depend more on certain measurements and less on others. Unequal weighting does not seem necessary for the Pharmaceutical Industry data because all of the numerical variables are similar in terms of measuring profitability, growth or risk.
Both the hierarchical and non-hierarchical algorithms were used in this problem. The hierarchical method does require one to specify the number of clusters and allows one to represent the clustering process through dendrograms; this makes it easier to how the groups are clustered. The hierarchical method does only makes one pass through the data and this means that records allocated incorrectly initially cannot be fixed later on. The method is also sensitive to changes in the distance metric and could lead to completely different clusters. Hierarchical is also sensitive to outliers. The non-hierarchical method is good if the goal is to form a pre-specified number of k clusters. If the number of clusters is not pre-determined, a dendrogram from the hierarchical method can be first used, and then the K-means method can be used next for further analysis. The K-means algorithm is good for measuring within-cluster dispersion. Low values for the average distance in cluster (near 1.0 means it is not effective for normalized measurements).
Below is the dendrogram for the data in the pharmaceutical set. Ward’s method was used for this problem because it considers the loss
...