SPC, A semantic pleonasm corpus

Semantic Pleonasm Corpus (SPC), is a collection of three thousand sentences. Each sentence features a pair of potentially semantically related words (chosen by a heuristic); human annotators determine whether either (or both) of the words is redundant. The corpus offers two improvements over current resources:

  1. First, the corpus filters for grammatical sentences so that the question of redundancy is separated from grammaticality.
  2. Second, the corpus is filtered for a balanced set of positive and negative examples (i.e., no redundancy).

The negative examples may make useful benchmark data – because they all contain a pair of words that are deemed to be semantically related, a successful system cannot rely on simple heuristics, such as semantic distances, for discrimination.

Made available under the terms of GNU General Public License. The corpus is distributed without any warranty.

To access the Pleonasm corpus, please fill out the following form. We respect your privacy and will not use your information for any purpose other than to assess interest in the resource.

For questions regarding the corpus, please reach out to kashefi@cs.pitt.edu