![]() ![]() Our corpus is available as free to download option for research purposes.Ability to create a new PDF file requires a $14.99 monthly subscription on iOS and Android. The best result ws obtained using the feature fusion technique ( F 1 = 0.855). In addition, The proposed techniques have also been applied to the UPPC corpus to check their performance at the document level. Furthermore, the performance increases when features of the proposed (ST) and baseline (N-gram) are combined for the classification task. The experimental results indicate that our proposed feature-fusion technique is the most suitable for the Urdu paraphrase detection task. N-gram is treated as the baseline technique for our research. Moreover, several techniques were proposed, devloped, and compared as a secondary contribution, including Word Embedding (WE), Sentence Transformers (ST), and feature-fusion techniques. The proposed Urdu Sentential Paraphrases (USP) corpus contains 4,900 sentences (2,941 paraphrased and 1,959 non-paraphrased), manually collected from the Urdu newspapers. Therefore, this study aims to develop a large-scale manually annotated benchmark Urdu paraphrase detection corpus at the sentence level, based on real cases from journalism. The available related studies on the Urdu language only focus on text reuse detection tasks at the passage and document levels. It is mainly due to the unavailability of the corpora that focus on the sentence level. However, no research has been conducted on sentence-level paraphrase detection in Urdu, a low-resourced language. There have been very few efforts for paraphrase detection in South Asian languages. Previously, the researchers have mainly focused on developing resources for the English language for paraphrase detection. Paraphrase detection systems uncover the relationship between two text fragments and classify them as paraphrased when they convey the same idea otherwise non-paraphrased. The corpus is a useful benchmark resource for the future development and assessment of cross‐language text reuse detection systems for the English‐Urdu language pair. Evaluation results (f1=0.732 binary, f1=0.552 ternary classification) indicate that it is harder to detect cross‐language real cases of text reuse, especially when the language pairs have unrelated scripts. Further, as a second contribution, we evaluate the Translation plus Mono‐lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. ![]() It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. The Cross‐Language English‐Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. To overcome this problem, we propose a cross‐language sentence/passage level text reuse corpus for the English‐Urdu language pair. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large‐scale gold standard evaluation resources built on real cases. The recent rise in multi‐lingual content on the Web has increased cross‐language text reuse to an unprecedented scale. ![]() Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. ![]()
0 Comments
Leave a Reply. |