← Seminars

Leveraging Large Language Models for Loneliness Detection and Analysis

Michelle Kim

Abstract

This research investigates the application of Large Language Models (LLMs) in measuring and analyzing loneliness in the caregiver and non-caregiver populations to enable building diverse social media datasets to study loneliness across the two populations and better understand their experiences of loneliness. Firstly, this research applies GPT-4o, GPT-5-nano, and GPT-5 to evaluate and detect high quality Reddit posts from 15 subreddits. We developed an expert-developed framework to measure loneliness and an expert-informed cause of loneliness typology framework to identify and categorize causes of loneliness across populations. This complete data processing pipeline is validated with human annotation and resulted in a validated data processing pipeline that judges a given post’s relevance, measures the author’s loneliness, extracts and categorizes the author’s cause of loneliness, and extracts demographic information. We find that LLMs are able to be successfully applied to measure loneliness via a psychologically grounded framework in the caregiver and non-caregiver populations, achieving 76.09% and 79.78% average accuracy respectively. Additionally, we find that LLMs are able to effectively apply the cause of loneliness categorization framework on high-quality Reddit posts, achieving high micro-F1 scores of 0.825 and 0.8 in the caregiver and non-caregiver populations, respectively. We find that the distribution of cause categories strongly differs across the two populations, suggesting our dataset and framework captures differences between the two populations. We find that the perceived causes of loneliness between the two populations highly differ, with caregiver’s loneliness predominately originating from their role as caregivers, demonstrating the loneliness experiences between the two populations are distinct. Through applying these validated frameworks, we successfully created a dataset of high quality posts for both populations. Through demographic data extraction, we find that Reddit data is viable for building a diverse dataset across 6 demographic categories in the caregiver population. This work contributes to understanding caregiver and non-caregiver loneliness by establishing an LLM-based data processing pipeline for sourcing high quality and diverse social media data and demonstrating successful application of LLMs to analyze differences in the loneliness of the two populations.

Term
Fall 2025
Date
October 24, 2025
Time
3:00 - 4:00 PM
Location
White Hall 100