DataSelection-NMT

Repository for the experiments in my paper accepted to the CLIN Journal: "Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts"

View the Project on GitHub JoyeBright/DataSelection-NMT

Data Selection in NMT

Welcome to the repository designed based on FAIR principles for the experiments described in: “Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts”.

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

You can read the paper on ArXiv, ResearchGate, Publisher’s website.

Abstract

Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.

Data Selection Tool

We also developed a Python tool that streamlines the process of selecting domain-specific data from generic corpora and training a domain-specific machine translation model. Our tool is particularly useful in scenarios where there is a dearth of domain-specific data or only monolingual data is available. Moreover, our tool is flexible and can handle varying sizes of domain-specific data. To learn more about this tool, please visit our GitHub repository at https://github.com/JoyeBright/DataSelection-NMT/tree/main/Tools_DS.

Our Pre-trained models on Hugging Face

|System | Link | System | Link | |:————-:|:—-:|:——-:|:—-:| |Top1 |Download|Top1|Download|
|Top2+Top1 |Download|Top2|Download| |Top3+Top2+…|Download|Top3|Donwload| |Top4+Top3+…|Download|Top4|Donwload| |Top5+Top4+…|Download|Top5|Donwload| |Top6+Top5+…|Download|Top6|Donwload|

Note: Bandwidth for Git LFS of personal account is 1GB/month. If you’re unable to download the models, follow this link.

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

4. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

or

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

5. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 4.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

6. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd 

Authors

Cite the paper

If you find this repository helpful, feel free to cite our publication:

@article{Pourmostafa Roshan Sharami_Sterionov_Spronck_2021, 
title={Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts}, 
volume={11}, 
url={https://www.clinjournal.org/clinj/article/view/137}, 
journal={Computational Linguistics in the Netherlands Journal}, 
author={Pourmostafa Roshan Sharami, Javad and Sterionov, Dimitar and Spronck, Pieter}, 
year={2021}, 
month={Dec.}, 
pages={213–230} }}