PART: Pre-trained Authorship Representation Transformer

Bookmark (0)
Please login to bookmark Close

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out -of -domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn authorship embeddings instead of semantics. We train our model on approximately 1.5M texts belonging to 1,162 literature authors, 17,287 blog post authors, and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving a competitive performance. We also evaluate our model on test splits of the datasets achieving zero -shot 72.39% accuracy when bound to 250 authors, which is 54% and 56% higher than RoBERTa embeddings. We qualitatively assess the representations through different data visualizations of the available datasets, observing features such as gender, age, or occupation of the author.

​Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out -of -domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn authorship embeddings instead of semantics. We train our model on approximately 1.5M texts belonging to 1,162 literature authors, 17,287 blog post authors, and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving a competitive performance. We also evaluate our model on test splits of the datasets achieving zero -shot 72.39% accuracy when bound to 250 authors, which is 54% and 56% higher than RoBERTa embeddings. We qualitatively assess the representations through different data visualizations of the available datasets, observing features such as gender, age, or occupation of the author. Read More