History

Songqing Zhang 15ace31a16 [Misc] Using the aliases of builtin types like np.int is deprecated (#7300 )		1 month ago
..
README.md	[Fix] fix some typos in example/ogb/deepwalk/ (#2190)	3 years ago

deepwalk.py	[Misc] Rename number_of_edges and number_of_nodes to num_edges and num_nodes in examples. (#5492)	1 year ago

load_dataset.py	[Misc] Black auto fix. (#4652)	1 year ago

model.py	[Misc] Black auto fix. (#4652)	1 year ago

reading_data.py	[Misc] Using the aliases of builtin types like np.int is deprecated (#7300)	1 month ago

utils.py	[Misc] Black auto fix. (#4652)	1 year ago

README.md

DeepWalk Example

DeepWalk Example

Paper link: here
Other implementation: gensim, deepwalk-c

The implementation includes multi-processing training with CPU and mixed training with CPU and multi-GPU.

Dependencies

PyTorch 1.5.0+

Tested version

PyTorch 1.5.0
DGL 0.5.0

Input data

Currently, we support two builtin dataset: youtube and blog. Use --data_file youtube to select youtube dataset and --data_file blog to select blog dataset.
The data is avaliable at https://data.dgl.ai/dataset/DeepWalk/youtube.zip and https://data.dgl.ai/dataset/DeepWalk/blog.zip
The youtube.zip includes both youtube-net.txt, youtube-vocab.txt and youtube-label.txt; The blog.zip includes both blog-net.txt, blog-vocab.txt and blog-label.txt.

For other datasets please pass the full path to the trainer through --data_file and the format of a network file should follow:

1(node id) 2(node id)
1 3
1 4
2 4
...

How to run the code

To run the code:

python3 deepwalk.py --data_file youtube --output_emb_file emb.txt --mix --lr 0.2 --gpus 0 1 2 3 --batch_size 100 --negative 5

How to save the embedding

By default the trained embedding is saved under --output_embe_file FILE_NAME as a numpy object.
To save the trained embedding in raw format(txt format), please use --save_in_txt argument.

Evaluation

To evalutate embedding on multi-label classification, please refer to here

YouTube (1M nodes).

Implementation	Macro-F1 (%) 1% 3% 5% 7% 9%	Micro-F1 (%) 1% 3% 5% 7% 9%
gensim.word2vec(hs)	28.73 32.51 33.67 34.28 34.79	35.73 38.34 39.37 40.08 40.77
gensim.word2vec(ns)	28.18 32.25 33.56 34.60 35.22	35.35 37.69 38.08 40.24 41.09
ours	24.58 31.23 33.97 35.41 36.48	38.93 43.17 44.73 45.42 45.92

The comparison between running time is shown as below, where the numbers in the brackets denote time used on random-walk.

Implementation	gensim.word2vec(hs)	gensim.word2vec(ns)	Ours
Time (s)	27119.6(1759.8)	10580.3(1704.3)	428.89

Parameters.

walk_length = 80, number_walks = 10, window_size = 5
Ours: 4GPU (Tesla V100), lr = 0.2, batchs_size = 128, neg_weight = 5, negative = 1, num_thread = 4
Others: workers = 8, negative = 5

Speeding-up with mixed CPU & multi-GPU. The used parameters are the same as above.

#GPUs	1	2	4
Time (s)	1419.64	952.04	428.89

OGB Dataset

How to load ogb data

You can run the code directly with:

python3 deepwalk --ogbl_name xxx --load_from_ogbl

However, ogb.linkproppred might not be compatible with mixed training with multi-gpu. If you want to do mixed training, please use no more than 1 gpu by the command above.

Evaluation

For evaluatation we follow the code mlp.py provided by ogb here.

Used config

ogbl-collab

python3 deepwalk.py --ogbl_name ogbl-collab --load_from_ogbl --save_in_pt --output_emb_file collab-embedding.pt --num_walks 50 --window_size 2 --walk_length 40 --lr 0.1 --negative 1 --neg_weight 1 --lap_norm 0.01 --mix --gpus 0 --num_threads 4 --print_interval 2000 --print_loss --batch_size 128 --use_context_weight
cd ./ogb/blob/master/examples/linkproppred/collab/
cp embedding_pt_file_path ./
python3 mlp.py --device 0 --runs 10 --use_node_embedding

ogbl-ddi

python3 deepwalk.py --ogbl_name ogbl-ddi --load_from_ogbl --save_in_pt --output_emb_file ddi-embedding.pt --num_walks 50 --window_size 2 --walk_length 80 --lr 0.1 --negative 1 --neg_weight 1 --lap_norm 0.05 --only_gpu --gpus 0 --num_threads 4 --print_interval 2000 --print_loss --batch_size 16 --use_context_weight
cd ./ogb/blob/master/examples/linkproppred/ddi/
cp embedding_pt_file_path ./
python3 mlp.py --device 0 --runs 10 --epochs 100

ogbl-ppa

python3 deepwalk.py --ogbl_name ogbl-ppa --load_from_ogbl --save_in_pt --output_emb_file ppa-embedding.pt --negative 1 --neg_weight 1 --batch_size 64 --print_interval 2000 --print_loss --window_size 1 --num_walks 30 --walk_length 80 --lr 0.1 --lap_norm 0.02 --mix --gpus 0 --num_threads 4
cp embedding_pt_file_path ./
python3 mlp.py --device 2 --runs 10

ogbl-citation

python3 deepwalk.py --ogbl_name ogbl-citation --load_from_ogbl --save_in_pt --output_emb_file embedding.pt --window_size 2 --num_walks 10 --negative 1 --neg_weight 1 --walk_length 80 --batch_size 128 --print_loss --print_interval 1000 --mix --gpus 0 --use_context_weight --num_threads 4 --lap_norm 0.01 --lr 0.1
cp embedding_pt_file_path ./
python3 mlp.py --device 2 --runs 10 --use_node_embedding

OGBL Results

ogbl-collab

#params: 61258346(model) + 131841(mlp) = 61390187

Hits@10

Highest Train: 74.83 ± 4.79

Highest Valid: 40.03 ± 2.98

Final Train: 74.51 ± 4.92

Final Test: 31.13 ± 2.47

Hits@50

Highest Train: 98.83 ± 0.15

Highest Valid: 60.61 ± 0.32

Final Train: 98.74 ± 0.17

Final Test: 50.37 ± 0.34

Hits@100

Highest Train: 99.86 ± 0.04

Highest Valid: 66.64 ± 0.32

Final Train: 99.84 ± 0.06

Final Test: 56.88 ± 0.37

obgl-ddi

#params: 1444840(model) + 99073(mlp) = 1543913

Hits@10

Highest Train: 33.91 ± 2.01

Highest Valid: 30.96 ± 1.89

Final Train: 33.90 ± 2.00

Final Test: 15.16 ± 4.28

Hits@20

Highest Train: 44.64 ± 1.71

Highest Valid: 41.32 ± 1.69

Final Train: 44.62 ± 1.69

Final Test: 26.42 ± 6.10

Hits@30

Highest Train: 51.01 ± 1.72

Highest Valid: 47.64 ± 1.71

Final Train: 50.99 ± 1.72

Final Test: 33.56 ± 3.95

ogbl-ppa

#params: 150024820(model) + 113921(mlp) = 150138741

Hits@10

Highest Train: 4.78 ± 0.73

Highest Valid: 4.30 ± 0.68

Final Train: 4.77 ± 0.73

Final Test: 2.67 ± 0.42

Hits@50

Highest Train: 18.82 ± 1.07

Highest Valid: 17.26 ± 1.01

Final Train: 18.82 ± 1.07

Final Test: 17.34 ± 2.09

Hits@100

Highest Train: 31.29 ± 2.11

Highest Valid: 28.97 ± 1.92

Final Train: 31.28 ± 2.12

Final Test: 28.88 ± 1.53

ogbl-citation

#params: 757811178(model) + 131841(mlp) = 757943019

MRR

Highest Train: 0.9381 ± 0.0003

Highest Valid: 0.8469 ± 0.0003

Final Train: 0.9377 ± 0.0004

Final Test: 0.8479 ± 0.0003

Notes

Multi-GPU issues

For efficiency, the results of ogbl-collab, ogbl-ppa, ogbl-ddi are run with multi-GPU. Since ogb is somehow incompatible with our multi-GPU implementation, we need to do some preprocessing. The command is:

python3 load_dataset.py --name dataset_name

It will output a data file to the local. For example, if dataset_name is ogbl-collab, then a file ogbl-collab-net.txt will be generated. Then we run

python3 deepwalk.py --data_file data_file_path

where the other parameters are the same with used configs without using --load_from_ogbl and --ogbl_name.

Others

The performance on ogbl-ddi and ogbl-ppa can be not that stable.

No Description

Python C++ Jupyter Notebook Cuda Text other

85214957+Rhett-Ying@users.noreply.github.com coin2028@hotmail.com chenhongzhi.nkcs@gmail.com zhengda1936@gmail.com mufeili1996@gmail.com wmjlyjemaine@gmail.com m.f.balin@gmail.com VoVAllen@users.noreply.github.com classicxsong@gmail.com 110809584+peizhou001@users.noreply.github.com expye@outlook.com mctt90@gmail.com 77922129+yxy235@users.noreply.github.com zekucai@gmail.com minjie.wang@nyu.edu xiny@nvidia.com 63612878+nv-dlasalle@users.noreply.github.com 32910461+drivanov@users.noreply.github.com 100203018+Skeleton003@users.noreply.github.com deluxurous@gmail.com ly979@nyu.edu huxk_hit@qq.com chang.liu@utexas.edu kylasa@gmail.com hetong007@gmail.com

How to access data resources in code

README.md

DeepWalk Example

Dependencies

Tested version

Input data

How to run the code

How to save the embedding

Evaluation

OGB Dataset

How to load ogb data

Evaluation

Used config

OGBL Results

Notes

Multi-GPU issues

Others

Contributors (25+) All

Contributors (25+)
All