Unlocking Social Media Insights: Hierarchical Clustering for Image Classification

By Yasmin M. Mohialden, Salah T. Allawi, and Nadia M. Hussien
Department of Computer Science, College of Science, Mustansiriyah University, Baghdad, Iraq

Every second, thousands of images are shared across social media platforms. From selfies and celebrity photos to memes and infographics, this digital flood represents a powerful lens into human behavior and culture. Yet, the very scale and diversity of these images make analysis nearly impossible without intelligent computational tools.

AI visualization of social media images grouped into clusters using hierarchical clustering, PCA, and t-SNE.

Our recent research, published in the Journal of Prospective Researches (Vol. 42, No. 4, 2024), introduces a Secure Hierarchical Agglomerative Clustering (HAC) approach that classifies and organizes social media images. By combining textual and visual features, and leveraging advanced techniques such as TF-IDF vectorization, Principal Component Analysis (PCA), and t-distributed Stochastic Neighbor Embedding (t-SNE), we created a model that reveals hidden structures within large image datasets.

Why Image Clustering Matters

Most social media analysis has focused on text—hashtags, comments, or captions—yet visual content dominates online sharing. Images communicate emotion, identity, and culture in ways that text alone cannot.

Clustering allows us to group similar images automatically, which supports:

Trend detection – spotting what is popular and emerging,
Community analysis – identifying groups and influencers,
Content personalization – enhancing recommendations,
Cultural insights – understanding visual narratives across regions.

Our Methodology

We employed Hierarchical Agglomerative Clustering because it offers a layered, tree-like view of data relationships. Unlike K-Means, which produces flat groups, HAC uncovers multi-level patterns.

The approach follows six structured steps (see Table-1):

Step	Description
Image Preprocessing	Images from the folder path are preprocessed to guarantee size and format uniformity. This involves shrinking pictures to 100x100 pixels and normalizing pixel values to 0–1.
Textual Feature Extraction	TF-IDF vectorization extracts textual information from picture filenames. This technique captures each image's unique textual features, allowing clustering to include text.
Hierarchical Agglomerative Clustering	Hierarchical clustering with agglomeration uses textual feature extraction TF-IDF vectors. This approach merges the most comparable clusters until it achieves the target number using a linkage criterion like Ward's. A hierarchical technique can identify clusters at several granularity levels, revealing the data's hierarchical structure.
Dimensionality Reduction for Visualization	PCA and t-SNE are used on TF-IDF vectors to illustrate clustering findings in two dimensions. These algorithms translate high dimensional TF-IDF space into a lower dimensional space while retaining data structure, enabling understandable cluster display.
Visualization and Interpretation	Scattered plots show clustering findings as images colored by cluster assignment. Cluster boundaries, separability, and picture distribution are easier to comprehend with this view. Also, save clustering findings as images for further research and interpretation.
Documentation and Analysis	Text files are saved with cluster assignments for each picture, clustering parameters, and a technique description for documentation and analysis. This permits repeatability and comparability across datasets or experimental situations.

Table 1 - Proposed Method for Clustering Social Media Images

Preprocessing – resizing images to 100×100 pixels and normalizing pixel values.
Feature Extraction – applying TF-IDF to filenames to capture textual meaning.
Clustering – merging clusters using Ward’s linkage method.
Dimensionality Reduction – PCA and t-SNE compress high-dimensional data into 2D space.
Visualization – scatter plots showing cluster separability (Figures 1 & 2).
Documentation – saving results, cluster assignments, and parameters for reproducibility.

Figure 1 - Principal Component Analysis (PCA)

Figure 2 - t-distributed stochastic neighbour embedding (t-SNE)

Figure 3 illustrates this workflow.

Figure 3 - The proposed method steps are shown in this block diagram

The Dataset

For experimentation, we used YasminNadiaArabcSocialMediaImages from Kaggle—a dataset of Arabic social media celebrities. While region-specific, it provides a rich test case for clustering visual and textual signals.

The clustering produced five main groups (Clusters 0–4). For example:

Cluster 0 contained a wide range of similar images,
Cluster 3 focused on specific stylistic features,
Cluster 4 grouped images by distinct visual cues.

Table 7 of the paper outlines the assignments, while Figures 1 and 2 show PCA and t-SNE scatter plots, highlighting separations between clusters.

Results & Insights

The findings confirm that combining textual features (filenames) with visual analysis improves clustering accuracy. Key observations include:

Effective grouping – cartoon and celebrity-style images clustered well.
Complementary visualizations – PCA captured global patterns, while t-SNE revealed local details.
Cluster analysis – using Table 2, we compared widths, heights, and pixel counts, confirming that results were not random.
Outlier detection – unusual images were correctly flagged, enhancing robustness.

Aspect	Description
Cluster Size and Image Properties	- Check for image property patterns within each cluster.
Cluster Size and Image Properties	- Compare widths, heights, or pixel counts of cluster images to determine if clustering is based on size or resolution.
Cluster Distribution	- Analyze the distribution of image properties among clusters.
Cluster Distribution	- Identify if certain groups contain mostly a particular image format or if certain clusters have more colourful images.
Using Extreme Values in Image Properties	- Identify outliers in clustering results.<br>- Outliers are images with attributes differing significantly from others in their cluster.
Visual Inspection	- Visually inspect images within each cluster to identify common visual characteristics.
Visual Inspection	- Verify the clustering algorithm's capability to group comparable images.

Table 2 - An Analysis of Clustering Results and Image Properties

Challenges and Future Work

While effective, our approach highlighted several challenges:

Evaluation Metrics – Future studies should integrate silhouette scores and adjusted Rand index (ARI) for objective performance measures.
Dataset Scope – Expansion beyond Arabic social media celebrities will test generalizability.
Security Risks – HAC must address privacy concerns and vulnerability to adversarial attacks.
Scalability – With O(n³) complexity, HAC is computationally intensive; solutions may involve parallelization or approximation.

Broader Impact

The potential applications extend well beyond celebrity image clustering. Our framework could be used for:

Sentiment analysis in memes and visual content,
Cultural studies tracking trends across nations,
Healthcare imaging for medical clustering tasks,
Digital archiving in museums and libraries.

Future research will explore deep learning integration, interactive visualization dashboards, and domain-specific extensions such as character identification or stylistic classification.

This research demonstrates how hierarchical agglomerative clustering, supported by TF-IDF, PCA, and t-SNE, can unlock insights from the overwhelming visual content of social media. By uncovering meaningful groupings and trends, our work contributes to the growing field of AI-driven social media analytics.

As social media continues to grow as a mirror of society, developing tools that not only classify images but also reveal hidden cultural and behavioral narratives is both timely and necessary. Our study is a step toward that future.

Nordic RD Bridge

Unlocking Social Media Insights: Hierarchical Clustering for Image Classification