View on GitHub

hatespeechdata

Hate speech data

Hate Speech Datasets

This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.

Data included on the list should be accessible via either the author or direct download, in most cases, though some other significant work may be listed.

The list is maintained by Leon Derczynski and Bertie Vidgen.

Please make contributions via pull request or email. Accompanying data statements preferred for all corpora.

Arabic

  1. L-HSAB

Danish

  1. DKhate
    • Annotation type: Offensive speech, target, and grade
    • Annotation level: Document
    • Text genre: Twitter, Reddit, News comments
    • Size: 3600
    • Data link: to appear 2019
    • Reference: Cross-lingual Multi-Platform Hate Speech Detection (to appear)

English

  1. Davidson et al.
  2. Wikipedia Detox
  3. Waseem & Hovy
  4. Imperium
  5. OffensEval 2019
  6. Liu et al.
  7. StormfrontWS
  8. Toxic Comment Classification Challenge
  9. hatEval
  10. Founta et al.

German

  1. IWG_hatespeech_public
  2. GermEval 2018
  3. GermEval 2019

Indonesian

  1. Ibrohim & Budi

Italian

  1. HSC
  2. CREEP EIT

Polish

  1. PolEval 2019

Portuguese

  1. Fortuna et al.
  2. OffComBR

Spanish

  1. hatEval

Lists of abusive keywords

  1. Hatebase
    • “Researchers are encouraged to take advantage of Hatebase’s vocabulary dataset, which is a valuable lexicon for searching other data repositories such as public forums, as well as Hatebase’s sightings dataset, which is useful for trending analysis”
    • Data link: hatebase.org/academia
  2. Hurtlex
  3. Gorrell et al.
  4. Wiegand et al.
  5. Chandrasekharan et al.

This page is http://hatespeechdata.com/.