Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
little_laska 9d62df0125 | 1 year ago | |
---|---|---|
Ab3P | 1 year ago | |
__pycache__ | 1 year ago | |
cybertron | 1 year ago | |
images | 1 year ago | |
preprocess | 1 year ago | |
src | 1 year ago | |
template | 1 year ago | |
LICENSE | 1 year ago | |
README.md | 1 year ago | |
eval.py | 1 year ago | |
eval_ms.py | 1 year ago | |
inference.py | 1 year ago | |
path_Ab3P | 1 year ago | |
setup.cfg | 1 year ago | |
setup.py | 1 year ago | |
train.py | 1 year ago | |
train_ms.py | 1 year ago | |
utils.py | 1 year ago |
BioSyn
具有同义词边缘化的生物医学实体表示
BioSyn项目中包含的是用于学习生物医学实体表示的代码,原始项目代码,项目论文。该项目是BioSyn项目的MindSpore版训练和测试代码以及相关环境配置。
python: 3.7.5
numpy
tqdm
scikit-learn
pytorch: 1.12.0
transformers: 4.11.3
mindspore-gpu: 1.8.1
注意Pytorch的版本需要和CUDA的版本相对应,该项目是在V100 CUDA版本11.1上进行训练和测试。
该项目中使用到BERT模型,是基于Cybertron代码(MindSpore实现的Transformers)进行开发的,因为有部分源代码的改动,所以直接拷贝Cybertron代码到该项目目录下,Cybertron的安装和使用见链接。
数据集包括了queries(train, dev, test, and traindev),和dictionaries (train_dictionary, dev_dictionary, and test_dictionary)两部分。dictionaries数据集的区别是test_dictionary中包含了train和dev数据集中的实体mentions,而dev_dictionary包含train数据集中的实体mentions来增加覆盖率。queries数据的预处理步骤包括转小写、删除标点、复合提及解析以及缩写解析(Ab3P)。dictionaries数据的预处理包括转小写、删除标点。
项目中使用开发集(dev)来寻找超参数,然后在traindev(train+dev)数据集上测试最终的模型表现。以下是模型训练和测试所用到的数据集:
以上三个数据集也可直接在该项目数据集处进行获取。
以下是使用NCBI-Disease数据集(train+dev)在BioBERTv1.1模型上进行微调的指令:
MODEL_NAME_OR_PATH=dmis-lab/biobert-base-cased-v1.1
OUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease
CUDA_VISIBLE_DEVICES=0 python train_ms.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_dictionary_path ${DATA_DIR}/train_dictionary.txt \
--train_dir ${DATA_DIR}/processed_traindev \
--output_dir ${OUTPUT_DIR} \
--use_cuda \
--topk 20 \
--epoch 10 \
--train_batch_size 16\
--learning_rate 1e-5 \
--max_length 25
你可以使用'processed_train'来训练模型,并在'processed_dev'上进行测试来寻找模型的超参数。(参数'--save_checkpoint_all'可能会有帮助)
以下是使用NCBI-Disease数据集(test)验证我们训练得到的模型的脚本。
MODEL_NAME_OR_PATH=./tmp/biosyn-biobert-ncbi-disease-th
OUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease-th
DATA_DIR=./datasets/ncbi-disease
python eval_ms.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--dictionary_path ${DATA_DIR}/test_dictionary.txt \
--data_dir ${DATA_DIR}/processed_test \
--output_dir ${OUTPUT_DIR} \
--use_cuda \
--topk 20 \
--max_length 25 \
--save_predictions
模型的预测结果会存放在'predictios_eval.json'文件中,其中保存了实体mentions、实体候选以及模型在整个数据集上的准确率(参数'--save_predictions'需要打开)。
以下是一个例子。
{
"queries": [
{
"mentions": [
{
"mention": "ataxia telangiectasia",
"golden_cui": "D001260",
"candidates": [
{
"name": "ataxia telangiectasia",
"cui": "D001260|208900",
"label": 1
},
{
"name": "ataxia telangiectasia syndrome",
"cui": "D001260|208900",
"label": 1
},
{
"name": "ataxia telangiectasia variant",
"cui": "C566865",
"label": 0
},
{
"name": "syndrome ataxia telangiectasia",
"cui": "D001260|208900",
"label": 1
},
{
"name": "telangiectasia",
"cui": "D013684",
"label": 0
}]
}]
},
...
],
"acc1": 0.9114583333333334,
"acc5": 0.9385416666666667
}
模型 | 数据集 | Acc@1/Acc@5 | 训练框架 |
---|---|---|---|
biosyn-biobert-ncbi-disease-th | NCBI-disease | 90.2/94.1 | pytorch |
biosyn-biobert-ncbi-disease-ms | NCBI-disease | 92.3/94.6 | MindSpore |
biosyn-biobert-bc5cdr-disease-th | bc5cdr-disease | 92.6/96.2 | pytorch |
biosyn-biobert-bc5cdr-disease-ms | bc5cdr-disease | 93.5/96.4 | MindSpore |
biosyn-biobert-bc5cdr-chemical-th | bc5cdr-chemical | 90.1/94.1 | pytorch |
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》