PropagandaDetectionNLP

SEMEVAL 2020 TASK 11 “DETECTION OF PROPAGANDA TECHNIQUES IN NEWS ARTICLES”

Propaganda is commonly defined as information of a biased or misleading nature, possibly purposefully shaped, to promote an agenda or a cause. In this project we are trying to build machine learning system for the Detection of Propaganda Techniques in News Articles.There are two subtasks to be solved as part of this project which are Span Identification and Technique Classification. We are able to secure position 17 on leader board in SI Task and Position 20 in TC Task.

The propaganda detection pipeline includes two sub tasks

Task-1 - Span Identification (SI): Given a plain-text document, identify those specific fragments which are propagandistic.
Task-2 - Technique Classification (TC): Given a text fragment identified as propaganda and its document context, identify the applied propaganda technique in the fragment. (14 class Classification Task)

14 class distribution

Task-2 is a 14-class classification task. The distribution amongst the classes is shown below. Dataset is highly imbalance

Word Cloud – Propaganda span from training dataset

Many propaganda includes words like god, church and Muslim. It shows that religion is used as propaganda more commonly.

Baseline Architecture

Task-1 – Span Identification Task: For the baseline architecture we created P/NP tagging and trained Bi-Directional LSTM Model. Please chck here for more details
Task-2 – Technique Classification Task: For the baseline architecture as features we used context, the span present in the context and ratio of length of context and length of span. Please check here for more details

Final Architecture

Task1 - Span Identification Task:For span identification we makes use of a state-of-the-art language model BERT enhanced by tagging schemes for token level classification with BIOE encoding scheme. Please check here for more details

Task2 - Technique Classification Task:For the technique classification model, we use BERT language model to get the contextual sequence representation for the propaganda span and its context to perform classification. Please chck here for more details

Leaderboard Results

Team Information

SI task

TC task

Tools and Technologies

Python
Sklearn
Tensorflow Keras
Spacy and NLTK
Huggingface - BERT
Jupyter Notebook

How to use this repository

Please check Readme.pdf here

Dataset Detail

Please check Readme.pdf here

References

Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam. . CoRR, abs 1711.05101.
Lance Ramshaw and Mitch Marcus. 1995. Text chunking using transformation-based learning. .In Third Workshop on Very Large Corpora.
Li,W., Li, S., Liu, C. et al. Span identification and technique classification of propaganda in news articles. ComplexIntell. Syst. (2021). https://doi.org/10.1007/s40747-021-00393-y
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR,abs/1907.11692.
Giovanni Da San Martino, Seunghak Yu, Alberto Barron-Cede ´ no, Rostislav Petrov, and Preslav Nakov. 2019.Fine-grained analysis of propaganda in news articles. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing EMNLP-IJCNLP 2019, EMNLP-IJCNLP 2019, Hong Kong, China, November.
Giovanni Da San Martino, Alberto Barron-Cede ´ no, Henning Wachsmuth, Rostislav Petrov, and Preslav Nakov. 2020. SemEval-2020 task 11: Detection of propaganda techniques in news articles. 2019. In Proceedings of the 14th International Workshop on Semantic Evaluation, SemEval 2020, Barcelona, Spain, September.