clav: R package and Shiny application for cluster analysis validation

R
Talk
I will be giving a talk at Joint Statistical Meeting.
Author

Jason Bryer

Published

August 5, 2025

Cluster analysis is a statistical procedure for grouping observations using an observation-centered approach as compared to variable-centric approaches (e.g. PCA, factor analysis). Whether a preprocessing step for predictive modeling or the primary analysis, validation is critical for determining generalizability across datasets. Theodoridis and Koutroumbas (2008) identified three broad types of validation for cluster analysis: 1) Internal cluster validation, 2) Relative cluster validation, and 3) External cluster validation. Strategies for steps 1 and 2 are well established, however cluster analysis is typically an unsupervised learning method where there is no observed outcome. Ullman et al (2021) proposed an approach to validating a cluster solution by visually inspecting the cluster solutions across a training and validation dataset. This talk introduces the clav R package that implements and expands this approach by generating multiple random samples (using either a simple random split or bootstrap samples). Visualizations of both the cluster profiles as well as distributions of the cluster means are provided along with a Shiny application to assist the researcher.

For more information about the project, visit: https://github.com/jbryer/clav

A student will also be presenting AI-Generated Text Detection in the Context of Domain- and Prompt-Specific Essays

The widespread adoption of Large Language Models has made distinguishing between human- and AI-generated essays more challenging. This study explores AI detection methods for domain- and prompt-specific essays within the Diagnostic Assessment and Achievement of College Skills (DAACS) framework, applying both random forest and fine-tuned ModernBERT classifiers. Our approach incorporates pre-chatGPT essays, likely human-generated, alongside synthetic datasets of essays generated and modified by AI. The random forest classifier was trained with open-source embeddings such as miniLM, RoBERTa, and a low-cost OpenAI model, using a one-versus-one strategy. The ModernBERT method employed a novel two-level fine-tuning strategy, incorporating essay-level and sentence-pair classifications that combines global text features with detailed sentence transitions through coherence scoring and style consistency detection. Together, these methods effectively identify whether essays have been altered by AI. Our approach provides a cost-effective solution for specific domains and serves as a robust alternative to generic AI detection tools, all while enabling local execution on consumer-grade hardware.

To register for the conference, go to https://ww2.amstat.org/meetings/jsm/2025/