Unlocking Social Media Insights: Hierarchical Clustering for Image Classification
By Yasmin M. Mohialden, Salah T. Allawi, and Nadia M.
Hussien
Department of Computer Science, College of Science, Mustansiriyah
University, Baghdad, Iraq
Every second, thousands of images are shared across social
media platforms. From selfies and celebrity photos to memes and infographics,
this digital flood represents a powerful lens into human behavior and culture.
Yet, the very scale and diversity of these images make analysis nearly
impossible without intelligent computational tools.
Our recent research, published in the Journal of Prospective Researches (Vol. 42, No. 4, 2024), introduces a Secure Hierarchical Agglomerative Clustering (HAC) approach that classifies and organizes social media images. By combining textual and visual features, and leveraging advanced techniques such as TF-IDF vectorization, Principal Component Analysis (PCA), and t-distributed Stochastic Neighbor Embedding (t-SNE), we created a model that reveals hidden structures within large image datasets.
Why Image Clustering Matters
Most social media analysis has focused on text—hashtags,
comments, or captions—yet visual content dominates online sharing.
Images communicate emotion, identity, and culture in ways that text alone
cannot.
Clustering allows us to group similar images automatically,
which supports:
- Trend
detection – spotting what is popular and emerging,
- Community
analysis – identifying groups and influencers,
- Content
personalization – enhancing recommendations,
- Cultural
insights – understanding visual narratives across regions.
Our Methodology
We employed Hierarchical Agglomerative Clustering
because it offers a layered, tree-like view of data relationships. Unlike
K-Means, which produces flat groups, HAC uncovers multi-level patterns.
The approach follows six structured steps (see Table-1):
Step |
Description |
Image Preprocessing |
Images from the folder
path are preprocessed to guarantee size and format uniformity. This involves
shrinking pictures to 100x100 pixels and normalizing pixel values to 0–1. |
Textual Feature Extraction |
TF-IDF
vectorization extracts textual information from picture filenames. This
technique captures each image's unique textual features, allowing clustering
to include text. |
Hierarchical Agglomerative Clustering |
Hierarchical
clustering with agglomeration uses textual feature extraction TF-IDF vectors.
This approach merges the most comparable clusters until it achieves the
target number using a linkage criterion like Ward's. A hierarchical technique
can identify clusters at several granularity levels, revealing the data's
hierarchical structure. |
Dimensionality Reduction for Visualization |
PCA and t-SNE
are used on TF-IDF vectors to illustrate clustering findings in two
dimensions. These algorithms translate high dimensional TF-IDF space into a
lower dimensional space while retaining data structure, enabling
understandable cluster display. |
Visualization and
Interpretation |
Scattered plots show
clustering findings as images colored by cluster assignment. Cluster
boundaries, separability, and picture distribution are easier to comprehend
with this view. Also, save clustering findings as images for further research
and interpretation. |
Documentation and Analysis |
Text files
are saved with cluster assignments for each picture, clustering parameters,
and a technique description for documentation and analysis. This permits
repeatability and comparability across datasets or experimental situations. |
Table 1 - Proposed Method for Clustering Social Media Images
- Preprocessing
– resizing images to 100×100 pixels and normalizing pixel values.
- Feature
Extraction – applying TF-IDF to filenames to capture textual meaning.
- Clustering
– merging clusters using Ward’s linkage method.
- Dimensionality
Reduction – PCA and t-SNE compress high-dimensional data into 2D
space.
- Visualization
– scatter plots showing cluster separability (Figures 1 & 2).
- Documentation
– saving results, cluster assignments, and parameters for reproducibility.
Figure 3 illustrates this workflow.
Figure 3 - The proposed method steps are shown in this block diagram
The Dataset
For experimentation, we used YasminNadiaArabcSocialMediaImages
from Kaggle—a dataset of Arabic social media celebrities. While
region-specific, it provides a rich test case for clustering visual and
textual signals.
The clustering produced five main groups (Clusters 0–4).
For example:
- Cluster
0 contained a wide range of similar images,
- Cluster
3 focused on specific stylistic features,
- Cluster
4 grouped images by distinct visual cues.
Table 7 of the paper outlines the assignments, while Figures
1 and 2 show PCA and t-SNE scatter plots, highlighting separations between
clusters.
Results & Insights
The findings confirm that combining textual features
(filenames) with visual analysis improves clustering accuracy. Key
observations include:
- Effective
grouping – cartoon and celebrity-style images clustered well.
- Complementary
visualizations – PCA captured global patterns, while t-SNE revealed
local details.
- Cluster
analysis – using Table 2, we compared widths, heights, and
pixel counts, confirming that results were not random.
- Outlier
detection – unusual images were correctly flagged, enhancing
robustness.
Aspect |
Description |
Cluster Size and
Image Properties |
-
Check for image property patterns within each
cluster. |
-
Compare widths, heights,
or pixel counts of cluster images to determine if clustering is based on size
or resolution. |
|
Cluster
Distribution |
-
Analyze the distribution of image properties among
clusters. |
-
Identify if certain
groups contain mostly a particular image format or if certain clusters have
more colourful images. |
|
Using Extreme
Values in Image Properties |
-
Identify outliers in clustering results.<br>-
Outliers are images with attributes differing significantly from others in
their cluster. |
Visual Inspection |
-
Visually inspect images
within each cluster to identify common visual characteristics. |
-
Verify the clustering algorithm's capability to
group comparable images. |
Challenges and Future Work
While effective, our approach highlighted several
challenges:
- Evaluation
Metrics – Future studies should integrate silhouette scores and
adjusted Rand index (ARI) for objective performance measures.
- Dataset
Scope – Expansion beyond Arabic social media celebrities will test
generalizability.
- Security
Risks – HAC must address privacy concerns and vulnerability to
adversarial attacks.
- Scalability
– With O(n³) complexity, HAC is computationally intensive; solutions may
involve parallelization or approximation.
Broader Impact
The potential applications extend well beyond celebrity
image clustering. Our framework could be used for:
- Sentiment
analysis in memes and visual content,
- Cultural
studies tracking trends across nations,
- Healthcare
imaging for medical clustering tasks,
- Digital
archiving in museums and libraries.
Future research will explore deep learning integration,
interactive visualization dashboards, and domain-specific extensions
such as character identification or stylistic classification.
As social media continues to grow as a mirror of society,
developing tools that not only classify images but also reveal hidden cultural
and behavioral narratives is both timely and necessary. Our study is a step
toward that future.