View on GitHub

hatespeechdata

Hate speech data

This page catalogues datasets annotated for hate speech, online abuse, and offensive language. They may be useful for e.g. training a natural language processing system to detect this language.

The list is maintained by Leon Derczynski and Bertie Vidgen.

Please make contributions via pull request or email. Accompanying data statements preferred for all corpora.

If you use these resources, please cite (and read!) our paper: Directions in Abusive Language Training Data: Garbage In, Garbage Out. And if you would like to find other resources for researching online hate, visit The Alan Turing Institute’s Online Hate Research Hub or read The Alan Turing Institute’s Reading List on Online Hate and Abuse Research.

If you’re looking for a good paper on online hate training datasets (beyond our paper, of course!) then have a look at ‘Resources and benchmark corpora for hate speech detection: a systematic review’ by Poletto et al. in Language Resources and Evaluation.

List of datasets

Arabic

1. Are They our Brothers? Analysis and Detection of Religious Hate Speech in the Arabic Twittersphere

2. Multilingual and Multi-Aspect Hate Speech Analysis (Arabic)

3. L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language

4. Abusive Language Detection on Arabic Social Media (Twitter)

5. Abusive Language Detection on Arabic Social Media (Al Jazeera)

6. Dataset Construction for the Detection of Anti-Social Behaviour in Online Communication in Arabic

Croatian

7. Datasets of Slovene and Croatian Moderated News Comments

Danish

8. Offensive Language and Hate Speech Detection for Danish

English

9. Automated Hate Speech Detection and the Problem of Offensive Language

10. Hate Speech Dataset from a White Supremacy Forum

11. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

12. Detecting Online Hate Speech Using Context Aware Models

13. Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter

14. When Does a Compliment Become Sexist? Analysis and Classification of Ambivalent Sexism Using Twitter Data

15. Overview of the Task on Automatic Misogyny Identification at IberEval 2018 (English)

14. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (English)

17. Characterizing and Detecting Hateful Users on Twitter

18. A Benchmark Dataset for Learning to Intervene in Online Hate Speech (Gab)

19. A Benchmark Dataset for Learning to Intervene in Online Hate Speech (Reddit)

20. Multilingual and Multi-Aspect Hate Speech Analysis (English)

21. Exploring Hate Speech Detection in Multimodal Publications

22. Predicting the Type and Target of Offensive Posts in Social Media

23. hatEval, SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (English)

24. Peer to Peer Hate: Hate Speech Instigators and Their Targets

25. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

26. Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

27. A Large Labeled Corpus for Online Harassment Research

28. Ex Machina: Personal Attacks Seen at Scale, Personal attacks

29. Ex Machina: Personal Attacks Seen at Scale, Toxicity

30. Detecting cyberbullying in online communities (World of Warcraft)

31. Detecting cyberbullying in online communities (League of Legends)

32. A Quality Type-aware Annotated Corpus and Lexicon for Harassment Research

33. Ex Machina: Personal Attacks Seen at Scale, Aggression and Friendliness

French

34. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (French)

35. Multilingual and Multi-Aspect Hate Speech Analysis (French)

German

36. Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis

37. Detecting Offensive Statements Towards Foreigners in Social Media

38. GermEval 2018

39. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Greek

40. Deep Learning for User Comment Moderation, Flagged Comments

41. Deep Learning for User Comment Moderation, Moderated Comments

42. Offensive Language Identification in Greek

Hindi-English

43. Aggression-annotated Corpus of Hindi-English Code-mixed Data

44. Aggression-annotated Corpus of Hindi-English Code-mixed Data

45. Did You Offend Me? Classification of Offensive Tweets in Hinglish Language

46. A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection

47. Overview of the HASOC track at FIRE 2019: Hate Speech and Offensive Content Identification in Indo-European Languages

Indonesian

48. Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study

49. Multi-Label Hate Speech and Abusive Language Detection in Indonesian Twitter

50. A Dataset and Preliminaries Study for Abusive Language Detection in Indonesian Social Media

Italian

51. An Italian Twitter Corpus of Hate Speech against Immigrants

52. Overview of the EVALITA 2018 Hate Speech Detection Task (Facebook)

53. Overview of the EVALITA 2018 Hate Speech Detection Task (Twitter)

54. CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech (Italian)

55. Creating a WhatsApp Dataset to Study Pre-teen Cyberbullying

Polish

56. Results of the PolEval 2019 Shared Task 6:First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter

Portuguese

57. A Hierarchically-Labeled Portuguese Hate Speech Dataset

58. Offensive Comments in the Brazilian Web: A Dataset and Baseline Results

Slovene

59. Datasets of Slovene and Croatian Moderated News Comments

Spanish

60. Overview of MEX-A3T at IberEval 2018: Authorship and Aggressiveness Analysis in Mexican Spanish Tweets

61. Overview of the Task on Automatic Misogyny Identification at IberEval 2018 (Spanish)

62. hatEval, SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (Spanish)

Turkish

63. A Corpus of Turkish Offensive Language on Social Media


Lists of abusive keywords

  1. Hatebase
    • “Researchers are encouraged to take advantage of Hatebase’s vocabulary dataset, which is a valuable lexicon for searching other data repositories such as public forums, as well as Hatebase’s sightings dataset, which is useful for trending analysis”
    • Data link: hatebase.org/academia
  2. Hurtlex
  3. Gorrell et al.
  4. Wiegand et al.
  5. Chandrasekharan et al.

This page is http://hatespeechdata.com/.