Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
|
3 months ago | |
---|---|---|
data | 3 months ago | |
src | 3 months ago | |
README.md | 3 months ago |
BioERP: a biomedical heterogeneous network-based self-supervised representation learning approach for entity relationship predictions.
BioERP is tested to work under:
Download the source code of BERT.
Manually replace the run_pretraining.py
The network representation model and training regime in BioERP are similar to the original implementation described in “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". Therefore, the code of network representation of BioERP can be downloaded from https://github.com/google-research/bert. But BERT uses a combination of two tasks, i.e,. masked language learning and the consecutive sentences classification. Nevertheless, different from natural language modeling, meta paths do not have a consecutive relationship. Therefore, BioERP does not involve the continuous sentences training. If you want to run BioERP, please manually replace the run_pretraining.py and run_classifier.py in BERT with these files.
Download the BERT-Base, Uncased model: 12-layer, 768-hidden, 12-heads.
You can construct a vocab file (vocab.txt) of nodes and modify the config file (bert_config.json) which specifies the hyperparameters of the model.
Run create_pretraining_data.py to mask metapath sample.
python create_pretraining_data.py \ --input_file=~path/metapath.txt \ --output_file=~path/tf_examples.tfrecord \ --vocab_file=~path/uncased_L-12_H-768_A-12/vocab.txt \ --do_lower_case=True \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ --masked_lm_prob=0.15 \ --random_seed=12345 \ --dupe_factor=5
The max_predictions_per_seq is the maximum number of masked meta path predictions per path sample. masked_lm_prob is the probability for masked token.
python run_pretraining.py \ --input_file=~path/tf_examples.tfrecord \ --output_dir=~path/Local_RLearing_output \ --do_train=True \ --do_eval=True \ --bert_config_file=~path/uncased_L-12_H-768_A-12/bert_config.json \ --train_batch_size=32 \ --max_seq_length=128 \ --max_predictions_per_seq=20 \ --num_train_steps=20000 \ --num_warmup_steps=10 \ --learning_rate=2e-5
python run_classifier.py \ --task_name=CoLA \ --do_train=true \ --do_eval=true \ --data_dir=~path/all_path \ --vocab_file=~path/vocab.txt \ --bert_config_file=~path/bert_config.json \ --max_seq_length=128 \ --train_batch_size=256 \ --learning_rate=2e-5 \ --num_train_epochs=10 \ --output_dir=~path/Global_RLearing_output
python extract_features.py \ --input_file=~path/node.txt \ --output_file=~path/output.jsonl \ --vocab_file=~path/uncased_L-12_H-768_A-12/vocab.txt \ --bert_config_file=~path/uncased_L-12_H-768_A-12/bert_config.json \ --init_checkpoint=~path/Local_RLearing_output(or Global_RLearing_output)/model.ckpt \ --layers=-1,-2,-3,-4 \ --max_seq_length=7 \ --batch_size=8
python TDI_NeoDTI.py
@article{BioERP2021,
title = {BioERP: biomedical heterogeneous network-based self-supervised representation learning approach for entity relationship predictions},
author = {Wang Xiaoqi, and Yang Yaning, and Li Kenli, and Li Wentao, and Li Fei, and Peng Shaoliang},
journal = {Bioinformatics},
year = {2021},
doi = {10.1093/bioinformatics/btab565}
}
If you have any questions or comments, please feel free to email: xqw@hnu.edu.cn.
BioERP: a biomedical heterogeneous network-based self-supervised representation learning approach for entity relationship predictions.
Python Text