Semantic Analysis of Donald Trump’s Posts
Executive summary
This analysis examines Donald Trump’s social media posts on Twitter (X) and Truth Social between 2009 and 2026, with the aim of identifying temporal, textual, and thematic patterns in his digital discourse. Processing more than 90,000 posts, it reveals an evolution in communication characterized by spikes in activity during electoral periods and a marked preference for nationalist and confrontational narratives.
The data cleaning and preprocessing phase retained 92.6% of the texts as valid inputs for analysis. Temporal results show a significant intensification of activity in 2020, while textual analysis indicates predominantly concise posts, with recurrent use of vocabulary centered on gratitude and patriotism. Semantic modeling identifies 146 main topics, dominated by political polls and endorsement messages, and an overall positive sentiment of 47.6%, suggesting a rhetorical strategy oriented toward optimism in the most recent period, between 2024 and 2026.
About the dataset
The analysis is conducted in Python, using specialized libraries for temporal manipulation and text processing. A dataset of 90,554 posts is imported, with a minimal incidence of missing values in the main text (below 0.01%) and a high absence in secondary elements such as hashtags, which reach 91.5%. This pattern suggests that Trump’s discourse prioritizes direct messages over the use of tags or mentions.
After removing 8,423 duplicate posts, a consolidated corpus of 82,131 unique entries spanning 6,092 days is obtained. This reflects a sustained communication strategy over time, with seemingly deliberate repetitions to reinforce key messages. An average activity of 12.5 posts per day is identified, with peaks associated with reactive responses to current events, evidencing the adaptability of Trump’s communication style to news and electoral cycles.
Overall, the high post-cleaning data retention rate of 92.6% establishes a robust analytical base, enabling the inference of authentic patterns in the evolution of his political narrative and interaction levels over time.
Exploratory Data Analysis
Temporal analysis reveals sustained growth from 2017, reaching a maximum in 2020 with more than 10,000 posts. This peak coincides with periods of high political and social tension, such as the impeachment process, the electoral cycle, and the COVID-19 pandemic, indicating a reactive use of social media as a mechanism to control and contest the public narrative during crises.
On a monthly level, October concentrates more than 8,000 posts, suggesting strategic alignment with year-end periods and moments of high informational and electoral intensity. In the weekly distribution, Monday through Friday exceed 10,000 posts, reflecting a strategy aimed at influencing the media agenda during the workweek. Activity is concentrated in daytime hours, with a peak close to 5,000 posts around 3:00 p.m., indicating optimization of reach during working hours and maximization of engagement both nationally and internationally.

The daily series exhibits high volatility and pronounced high-frequency noise. The relationship between the daily mean (around 12.5 posts) and its standard deviation (approximately 14.3) is a clear indication of overdispersion. When aggregating the data to weekly frequency, the series smooths considerably; however, the standard deviation remains high, around 72.7, reflecting the presence of large aggregated shocks. After applying a three-month moving average, long-range cycles and structural level changes become more apparent.

The figure classifies the corpus into five categories according to post length. The mode is the “Medium” category (21–50 words) with 28,021 posts, followed by “Short” (11–20 words) with 26,233 and “Very short” (up to 10 words) with 15,067 posts. Proportionally, more than half of the corpus consists of texts shorter than 20 words.

Length distributions, both in characters and words, show marked positive skewness. The median word count is 20 and the mean is 25–26; in characters, the median is 136 and the mean 169, confirming right skewness toward longer values. The high standard deviation indicates considerable dispersion. Boxplot analysis reveals numerous outliers, with texts reaching nearly 3,000 characters, responsible for the long tail of the distribution.

The word cloud reflects a typical Zipfian distribution of natural language, where a few terms account for much of the frequency while a long tail contains many infrequent terms. The prominence of words such as “people,” “america,” “thank,” “biden,” and “country” highlights semantic cores linked to national identity, direct audience appeal, and electoral context. From a statistical perspective, the high concentration at the head of the distribution indicates that even after removing stopwords, there remain “political function words” that act as thematic markers and contribute to the cohesion of certain clusters.

Each subseries presents the annual relative frequency of terms, allowing comparison while controlling for the volume of posts per year. This highlights notable breaks or jumps, as in the case of “biden,” and shows upward trends for terms such as “people” and “america” in years of greater exposure. In terms of dispersion, many series exhibit increasing variance, where the relative importance of certain terms becomes more volatile during periods of high activity.

Semantic Model
BERTopic was used for modeling on the cleaned text, selecting the highest-quality observations and taking a stratified sample by year. Embeddings were generated using a pretrained model, dimensionality was reduced with UMAP, clustering was performed with HDBSCAN, and vectorization via n-grams was applied to train BERTopic, resulting in a functional model with 146 topics.
The outlier percentage, at 43.8%, is high but common in short texts and highly variable discourse. Topic analysis reveals a complex semantic structure combining elements of political communication, institutional confrontation, and self-promotion, providing a comprehensive view of the discursive strategies present in the corpus.
The bar chart of the top ten topics confirms a long-tail distribution in cluster size, with the topic related to polls leading with approximately 994 documents in the sample. The pie chart highlights that the mass of “other topics” is substantial and that outliers represent a large fraction—typical when average text length is short and thematic variability is high. The topic-by-year heatmap shows temporal heterogeneity: some topics concentrate activity in specific periods, indicating political seasonality and shifts in thematic focus associated with changes in topic mixtures. Finally, the word cloud of the main topic validates the labels, showing coherent internal co-occurrences consistent with thematic interpretation.

The two-dimensional projection of the topic space reveals clusters and empty zones. Clusters indicate areas of high semantic density, while overall dispersion reflects the thematic diversity of the corpus. Bubble size is proportional to the number of documents, confirming that a few topics account for most of the volume. Separation between clusters suggests low average similarity between thematic families, consistent with the existence of multiple discursive axes such as electoral issues, judicial/media conflict, immigration, and personal branding.

The similarity matrix presents a dominant diagonal and off-diagonal blocks, indicating groups of topics more closely related to one another. Most intensities fall in medium-to-low ranges, with islands of high similarity that could justify mergers or meta-topics if taxonomy compaction were desired. The presence of very low or even negative similarities suggests the coexistence of orthogonal or opposing themes in the vocabulary.

Topic time series display abrupt spikes interspersed with periods of relative inactivity. The endorsements topic reaches the highest peak, with more than 100 posts in late 2025, reflecting a concentrated and likely short-lived discursive phase. In terms of variance, topics differ markedly in temporal volatility: some show low frequency but high-intensity spikes, while others maintain a more constant baseline activity. This heterogeneity indicates that the weekly proportion of posts by topic is a strong descriptor of system state in temporal analysis models and captures differentiated dynamics across topics.

The weekly sentiment series shows a positive mean of approximately 0.20, with confidence intervals rarely crossing zero, indicating a systematically favorable tone during the analyzed period. Variance is heteroskedastic, increasing in weeks with smaller sample sizes, as evidenced by wider confidence bands.
The weekly volume chart confirms overdispersion, with weeks of highly disparate counts. The scatter plot relating volume and sentiment reveals a low correlation (~0.12), suggesting near independence between posting intensity and emotional tone. The sentiment score histogram shows apparent bimodality, with a prominent mode at high values, consistent with the positive bias of the discourse and/or of the classifier in social media domains.

Summary of Results
The analyzed corpus spans from May 4, 2009, to January 8, 2026, totaling 82,131 posts, of which 76,047 are considered valid after preprocessing. This process entailed an average reduction of 19.1% in text length, eliminating noise without substantive loss of informational content. The series exhibits markedly non-stationary behavior, with volume peaks concentrated in 2015–2016 and especially in 2020, as well as sustained reactivation from 2024 onward.
The resulting thematic architecture identifies 146 topics organized into five high-traction axes: polls and electoral momentum; institutional and media conflict, with references to the FBI and the Mueller report; endorsements and coalition-building, including references to the Second Amendment; immigration and the border wall; and personal brand content, such as hotels and golf. Distance and similarity metrics confirm the existence of clearly differentiated semantic families alongside internally cohesive blocks. Monthly evolution shows clear transitions in discursive focus, such as the endorsements cycle in late 2025, while poll-related content maintains intermittent presence over time.
Conclusions
The analysis demonstrates that a massive and heterogeneous historical body of posts can be transformed into actionable communicational intelligence through a coherent analytical architecture. The weekly combination of topic proportions, sentiment, and volume enables segmentation of discourse into operational regimes, early detection of narrative shifts, and contextualization of activity spikes. At the same time, evidence of a moderately positive recent tone decoupled from volume prevents decisions based on the false equivalence between posting quantity and reputational improvement, guiding management toward differentiated metrics for intensity and narrative content.
The practical impact is immediate: the resulting indicators support the construction of a dashboard with early warnings for thematic changes, facilitate prioritization of qualitative analysis in critical weeks, and enable evaluation of the effects of messages or milestones. The strength of the approach lies in its replicability, as the same methodological framework can be applied to other actors or periods to establish baselines and comparative analyses.
Source:
- X: https://x.com/
- Truth Social: https://truthsocial.com/





