xwgeng fbce866a67 | 2 years ago | |
---|---|---|
fairseq | 2 years ago | |
fairseq_cli | 2 years ago | |
scripts | 2 years ago | |
README.md | 2 years ago | |
architecture.png | 2 years ago | |
binaried.sh | 2 years ago | |
eval_lm.py | 2 years ago | |
generate.py | 2 years ago | |
hubconf.py | 2 years ago | |
interactive.py | 2 years ago | |
preprocess.py | 2 years ago | |
score.py | 2 years ago | |
setup.py | 2 years ago | |
test.rewrite.nat.sh | 2 years ago | |
train.py | 2 years ago | |
train.rewrite.nat.sh | 2 years ago | |
validate.py | 2 years ago |
This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressive Neural Machine Translation". RewriteNAT is a iterative NAT model that which utilizes a locator component to explicitly learn to rewrite the erroneous translation pieces during iterative decoding.
All the datasets are tokenized using the scripts from Moses except for Chinese with Jieba tokenizer, and splitted into subword units using BPE. The tokenized datasets are binaried using the script binaried.sh
as follows:
python preprocess.py \
--source-lang ${src} --target-lang ${tgt} \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/${dataset} --thresholdtgt 0 --thresholdsrc 0 \
--workers 64 --joined-dictionary
All the models are run on 8 Tesla V100 GPUs for 300,000 updates with an effective batch size of 128,000 tokens apart from En→Fr where we make 500,000 updates to account for the data size. Towards obtaining better performance, CMLM are utilzied to initialize the parameters of our proposed RewriteNAT. The training scripts train.rewrite.nat.sh
is configured as follows:
python train.py \
data-bin/${dataset} \
--source-lang ${src} --target-lang ${tgt} \
--save-dir ${save_dir} \
--ddp-backend=no_c10d \
--task translation_lev \
--criterion rewrite_nat_loss \
--arch rewrite_nonautoregressive_transformer \
--noise full_mask \
${share_all_embeddings} \
--optimizer adam --adam-betas '(0.9,0.98)' \
--lr 0.0005 --lr-scheduler inverse_sqrt \
--min-lr '1e-09' --warmup-updates 10000 \
--warmup-init-lr '1e-07' --label-smoothing 0.1 \
--dropout 0.3 --weight-decay 0.01 \
--decoder-learned-pos \
--encoder-learned-pos \
--length-loss-factor 0.1 \
--apply-bert-init \
--log-format 'simple' --log-interval 100 \
--fixed-validation-seed 7 \
--max-tokens 4000 \
--save-interval-updates 10000 \
--max-update ${step} \
--update-freq 4 \
--fp16 \
--save-interval ${save_interval} \
--discriminator-layers 6 \
--train-max-iter ${max_iter} \
--roll-in-g sample \
--roll-in-d oracle \
--imitation-g \
--imitation-d \
--discriminator-loss-factor ${discriminator_weight} \
--no-share-discriminator \
--generator-scale ${generator_scale} \
--discriminator-scale ${discriminator_scale} \
--restore-file cmlm_big_128k_300k/${dataset}/checkpoint_cmlm_128k.pt \
--reset-optimizer \
--reset-meters \
--reset-dataloader \
--reset-lr-scheduler
We evaluate performance with BLEU for all language pairs, except for En→>Zh, where we use SacreBLEU. The testing scripts test.rewrite.nat.sh
is utilized to generate the translations, as follows:
python generate.py \
data-bin/${dataset} \
--source-lang ${src} --target-lang ${tgt} \
--gen-subset ${subset} \
--task translation_lev \
--path ${save_dir}/${dataset}/checkpoint_average_${suffix}.pt \
--iter-decode-max-iter ${max_iter} \
--iter-decode-with-beam ${beam} \
--iter-decode-p ${iter_p} \
--beam 1 --remove-bpe \
--batch-size 50\
--print-step \
--quiet
Please cite as:
@inproceedings{geng-etal-2021-learning,
title = "Learning to Rewrite for Non-Autoregressive Neural Machine Translation",
author = "Geng, Xinwei and Feng, Xiaocheng and Qin, Bing",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.265",
pages = "3297--3308",
}
Learning to Rewrite for Non-Autoregressive Neural Machine Translation
https://aclanthology.org/2021.emnlp-main.265.pdf
Python Cuda C++ Shell Lua
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》