cyk 697fa8f927 | 10 months ago | |
---|---|---|
.. | ||
README.en.md | 10 months ago | |
README.md | 10 months ago | |
convert.py | 10 months ago | |
predict.py | 10 months ago |
ACL 2023 (Findings) | arXiv | BibTex | English version
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
ERNIE-Code是一个多自然语言、多编程语言的统一代码语言模型(Code LLM),支持116种自然语言和6+种编程语言。采用了两种预训练方法来进行跨语言预训练:
ERNIE-Code在代码智能的各种下游任务中,包括代码到多自然语言、多自然语言到代码、代码到代码、多自然语言文档翻译等任务,优于以前的多语言代码和文本模型(例如mT5 和 CodeT5),同时在多自然语言的代码摘要和文档翻译等任务上具备较好的的zero-shot prompt能力。
本项目是ERNIE-Code的PaddlePaddle实现,包括模型预测和权重转换。以下是该示例的简要目录结构和说明:
├── README.md # 文档
├── predict.py # 前向预测示例
├── converter.py # 权重转换脚本
本项目提供了一个简单的多语言代码/文本生成的演示。启动命令如下:
python predict.py \
--input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \
--target_lang 'code' \
--source_prefix 'translate Japanese to Python: \n' \
--max_length 1024 \
--num_beams 3 \
--device 'gpu'
配置文件中参数的解释:
input
:输入的文本序列。target_lang
:目标语言,可设置为'text'或'code'。source_prefix
:提示词Prompt。max_length
:输入/输出文本的最大长度。num_beams
:解码时每个时间步保留的beam大小(用于束搜索)。device
:运行设备,可设置为'cpu'或'gpu'。@inproceedings{chai-etal-2023-ernie,
title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
author = "Chai, Yekun and
Wang, Shuohuan and
Pang, Chao and
Sun, Yu and
Tian, Hao and
Wu, Hua",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.676",
pages = "10628--10650",
abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
}
👑 Easy-to-use and powerful NLP library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Documen
Python C++ Cuda Shell Markdown other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》