History

cyk 697fa8f927 fix README.md (#6465 )		10 months ago
..
README.en.md	Fix ERNIE-Code import error (#6408)	10 months ago

README.md	fix README.md (#6465)	10 months ago

convert.py	Add Ernie-Code predict (#6166)	10 months ago

predict.py	Add Ernie-Code predict (#6166)	10 months ago

README.md

ERNIE-Code

ERNIE-Code

ACL 2023 (Findings) | arXiv | BibTex | English version

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

ERNIE-Code是一个多自然语言、多编程语言的统一代码语言模型（Code LLM），支持116种自然语言和6+种编程语言。采用了两种预训练方法来进行跨语言预训练：

Span-Corruption Language Modeling (SCLM) 从单语言的自然语言或编程语言中进行掩码语言学习；
Pivot-based Translation Language Modeling (PTLM)，将多自然语言到多编程语言的映射规约为，以英语为枢轴(pivot)的多自然语言到英语、和英语到多编程语言的联合学习。

ERNIE-Code在代码智能的各种下游任务中，包括代码到多自然语言、多自然语言到代码、代码到代码、多自然语言文档翻译等任务，优于以前的多语言代码和文本模型（例如mT5 和 CodeT5），同时在多自然语言的代码摘要和文档翻译等任务上具备较好的的zero-shot prompt能力。

快速开始

本项目是ERNIE-Code的PaddlePaddle实现，包括模型预测和权重转换。以下是该示例的简要目录结构和说明：

├── README.md               # 文档
├── predict.py              # 前向预测示例
├── converter.py            # 权重转换脚本

多语言文本到代码/代码到文本

本项目提供了一个简单的多语言代码/文本生成的演示。启动命令如下：

python predict.py \
  --input 'BadZipFileのAliasは、古い Python バージョンとの互換性のために。' \
  --target_lang 'code' \
  --source_prefix 'translate Japanese to Python: \n' \
  --max_length 1024 \
  --num_beams 3 \
  --device 'gpu'

配置文件中参数的解释：

input：输入的文本序列。
target_lang：目标语言，可设置为'text'或'code'。
source_prefix：提示词Prompt。
max_length：输入/输出文本的最大长度。
num_beams：解码时每个时间步保留的beam大小（用于束搜索）。
device：运行设备，可设置为'cpu'或'gpu'。

Zero-shot示例

多语言代码到文本生成（zero-shot）

计算机术语翻译（zero-shot）

BibTeX

@inproceedings{chai-etal-2023-ernie,
    title = "{ERNIE}-Code: Beyond {E}nglish-Centric Cross-lingual Pretraining for Programming Languages",
    author = "Chai, Yekun  and
      Wang, Shuohuan  and
      Pang, Chao  and
      Sun, Yu  and
      Tian, Hao  and
      Wu, Hua",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.676",
    pages = "10628--10650",
    abstract = "Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.",
}

👑 Easy-to-use and powerful NLP library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Documen

Python C++ Cuda Shell Markdown other

fangzeyang0904@hotmail.com sijun.he@hotmail.com zhonghui.net@gmail.com zhoushunjie@baidu.com 380185688@qq.com chenzeyu01@baidu.com 40840292+linjieccc@users.noreply.github.com 1435130236@qq.com 50394665+JunnYu@users.noreply.github.com 63761690+lugimzzz@users.noreply.github.com yyb0576@163.com 33639025+smallv0221@users.noreply.github.com 709153940@qq.com 623543001@qq.com gongel@qq.com wanghuijuan03@baidu.com 397551318@qq.com w5688414@gmail.com liujiaqi06@baidu.com tianxin04@baidu.com westfish@126.com 1834792141@qq.com 48793257+Steffy-zxf@users.noreply.github.com kinghuin_chull@163.com chenshuo07@baidu.com

How to access data resources in code