|
|
@@ -0,0 +1,504 @@ |
|
|
|
{ |
|
|
|
"cells": [ |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "9a583d55", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"# 飞桨端到端FAQ智能问答系统\n", |
|
|
|
"文档:https://openi.pcl.ac.cn/PaddlePaddle/PaddleNLP/src/branch/develop/pipelines/examples/FAQ" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "ed49d413", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"## 系统特色\n", |
|
|
|
"\n", |
|
|
|
"* 端到端\n", |
|
|
|
"\n", |
|
|
|
"提供包括数据建库、模型服务部署、WebUI 可视化一整套端到端FAQ智能问答系统能力\n", |
|
|
|
"\n", |
|
|
|
"多源数据支持: 支持对 Txt、Word、PDF、Image 多源数据进行解析、识别并写入 ANN 数据库\n", |
|
|
|
"* 效果好\n", |
|
|
|
"\n", |
|
|
|
"依托百度领先的NLP技术,包括ERNIE语义理解技术与RocketQA开放域问答技术\n", |
|
|
|
"\n", |
|
|
|
"预置领先的深度学习模型\n", |
|
|
|
"\n", |
|
|
|
"## 首先环境配置\n", |
|
|
|
"\n", |
|
|
|
"镜像已经安装好飞桨和paddleNLP:192.168.242.22:443/default-workspace/fccb038c23234b9e80105d4ccd152117/image:xmm\n", |
|
|
|
"\n", |
|
|
|
"若没有安装好环境,可以参考下面步骤:\n", |
|
|
|
"\n", |
|
|
|
"升级飞桨到2.5.1" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "b2678514", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# !python -m pip install paddlepaddle-gpu==2.5.1.post102 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html\n", |
|
|
|
"!pip uninstall paddlepaddle-gpu -y\n", |
|
|
|
"!python -m pip install paddlepaddle-gpu==2.5.1.post117 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "ead664f5", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# 如果没有git,就要安装git \n", |
|
|
|
"!apt update\n", |
|
|
|
"!apt install git -y" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "3c2223e6", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"### 安装pipelines非常容易卡住,建议分步分库安装" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "08a82b58", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# 下载PaddleNLP库文件\n", |
|
|
|
"!git clone https://openi.pcl.ac.cn/PaddlePaddle/PaddleNLP.git\n", |
|
|
|
"# !pip uninstall paddlenlp paddle-pipelines -y\n", |
|
|
|
"%cd /code/PaddleNLP/\n", |
|
|
|
"!pip install -r requirements.txt -i https://mirror.baidu.com/pypi/simple -q\n", |
|
|
|
"!python setup.py install " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "f8e6a204", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"%cd /code/PaddleNLP/pipelines\n", |
|
|
|
"!pip install -r requirements.txt -i https://mirror.baidu.com/pypi/simple \n", |
|
|
|
"!python setup.py install " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "9675e387", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"### 将PaddleNLP/pipelines/requirements.txt 文件拆分成多个文件安装" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "642ea577", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install -r /code/work/rq1.txt -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "0f834717", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install preshed " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "3f0a2b2b", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install -r /code/work/rq2.txt -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "719231ae", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install -r /code/work/rq3.txt -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "0209100c", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install fastapi uvicorn markdown numba -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "d2b8eb6c", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install pymilvus>=2.1 wordcloud==1.8.2.2 boilerpy3 events -i https://mirror.baidu.com/pypi/simple \n" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "29ccb336", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install sseclient-py==1.7.2 -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "096cbed5", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install typing_extensions==4.5 -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "10ee5ed0", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip install spacy -i https://mirror.baidu.com/pypi/simple " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "ea4fcede", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"现在看看飞桨相关库是否安装好\n", |
|
|
|
"\n", |
|
|
|
"有时候需要重启内核" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "9a1af1de", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!pip list |grep paddle" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "2d04df77", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"import paddle" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "3458b6a2", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"paddle.randn??" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "1af08e2a", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"## 端到端FAQ智能问答系统一键启动\n", |
|
|
|
"若能启动,证明整个系统环境正常,就可以后面的学习实践了。" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "95866015", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!cd /code/PaddleNLP/pipelines && python examples/FAQ/dense_faq_example.py --device gpu\n" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "bba09de2", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"## 3.4 构建 Web 可视化FAQ智能问答\n", |
|
|
|
"\n", |
|
|
|
"整个 Web 可视化FAQ智能问答主要包含 3 大组件: 1. 基于 ElasticSearch 的 ANN 服务 2. 基于 RestAPI 构建模型服务 3. 基于 Streamlit 构建 WebUI,接下来我们依次搭建这 3 个服务并最终形成可视化的FAQ智能问答。\n", |
|
|
|
"\n", |
|
|
|
"3.4.1 启动 ANN 服务\n", |
|
|
|
"\n", |
|
|
|
"参考官方文档下载安装 elasticsearch-8.3.2 并解压。\n", |
|
|
|
"启动 ES 服务\n", |
|
|
|
"首先修改config/elasticsearch.yml的配置:\n", |
|
|
|
"xpack.security.enabled: false\n", |
|
|
|
"\n", |
|
|
|
"已下载并安装在work/elasticsearch-8.8.2目录\n", |
|
|
|
"\n", |
|
|
|
"然后启动:\n", |
|
|
|
"\n", |
|
|
|
"./bin/elasticsearch\n", |
|
|
|
"\n", |
|
|
|
"到安装目录执行上面命令" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "4553fbe6", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true, |
|
|
|
"tags": [] |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# 检查确保 ES 服务启动成功\n", |
|
|
|
"!curl http://localhost:9200/_aliases?pretty=true" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "13769e5e", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"## 3.4.2 文档数据写入 ANN 索引库\n", |
|
|
|
"\n", |
|
|
|
"以保险数据集为例建立 ANN 索引库\n", |
|
|
|
"\n", |
|
|
|
"python utils/offline_ann.py --index_name insurance \\\n", |
|
|
|
" --doc_dir data/insurance \\\n", |
|
|
|
" --split_answers \\\n", |
|
|
|
" --delete_index\n", |
|
|
|
"\n", |
|
|
|
"参数含义说明\n", |
|
|
|
"* \n", |
|
|
|
"* index_name: 索引的名称\n", |
|
|
|
"* doc_dir: txt文本数据的路径\n", |
|
|
|
"* host: Elasticsearch的IP地址\n", |
|
|
|
"* port: Elasticsearch的端口号\n", |
|
|
|
"* split_answers: 是否切分每一行的数据为query和answer两部分\n", |
|
|
|
"* delete_index: 是否删除现有的索引和数据,用于清空es的数据,默认为false\n", |
|
|
|
"\n", |
|
|
|
"\n", |
|
|
|
"打印几条数据\n", |
|
|
|
"curl http://localhost:9200/insurance/_search\n", |
|
|
|
"会输出如下的示例结果:\n", |
|
|
|
"\n", |
|
|
|
"{\"took\":2,\"timed_out\":false,\"_shards\":{\"total\":1,\"successful\":1,\"ski" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "03f238fd", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true, |
|
|
|
"tags": [] |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# 以保险数据集为例建立 ANN 索引库\n", |
|
|
|
"\n", |
|
|
|
"# !cd ~/PaddleNLP/pipelines && python utils/offline_ann.py --index_name insurance \\\n", |
|
|
|
"# --doc_dir data/insurance \\\n", |
|
|
|
"# --split_answers \\\n", |
|
|
|
"# --delete_index" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "673c8f59", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true, |
|
|
|
"tags": [] |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!curl http://localhost:9200/insurance/_search" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "7f390f9e", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"## 3.4.3 启动 RestAPI 模型服务\n", |
|
|
|
"\n", |
|
|
|
" 指定FAQ智能问答系统的Yaml配置文件\n", |
|
|
|
"\n", |
|
|
|
"export PIPELINE_YAML_PATH=rest_api/pipeline/dense_faq.yaml\n", |
|
|
|
"\n", |
|
|
|
"使用端口号 8891 启动模型服务\n", |
|
|
|
"\n", |
|
|
|
"python rest_api/application.py 8891\n", |
|
|
|
"Linux 用户推荐采用 Shell 脚本来启动服务:\n", |
|
|
|
"\n", |
|
|
|
"sh examples/FAQ/run_faq_server.sh\n", |
|
|
|
"启动后可以使用curl命令验证是否成功运行:\n", |
|
|
|
"\n", |
|
|
|
"curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{\"query\": \"企业如何办理养老保险?\",\"params\": {\"Retriever\": {\"top_k\": 5}, \"Ranker\":{\"top_k\": 5}}}'3.4.3 " |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "bd90a0f4", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true, |
|
|
|
"tags": [] |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"# !export PIPELINE_YAML_PATH=rest_api/pipeline/dense_faq.yaml\n", |
|
|
|
"# # 使用端口号 8891 启动模型服务\n", |
|
|
|
"# !cd ~/PaddleNLP/pipelines && python rest_api/application.py 8891" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "244c35ac", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true, |
|
|
|
"tags": [] |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{\"query\": \"企业如何办理养老保险?\",\"params\": {\"Retriever\": {\"top_k\": 5}, \"Ranker\":{\"top_k\": 5}}}'\n" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "1cfeb300", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"不明白为什么8891没连上" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "97ad68bc", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true, |
|
|
|
"tags": [] |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [ |
|
|
|
"!netstat -an |grep 889" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "cef87ba1", |
|
|
|
"metadata": { |
|
|
|
"scrolled": true |
|
|
|
}, |
|
|
|
"outputs": [], |
|
|
|
"source": [] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "markdown", |
|
|
|
"id": "ccaf3cee", |
|
|
|
"metadata": {}, |
|
|
|
"source": [ |
|
|
|
"# 调试\n", |
|
|
|
"## 8891链接报错curl: (7) Failed to connect to localhost port 8891: Connection refused\n", |
|
|
|
"启动 RestAPI 模型服务章节:\n", |
|
|
|
"```\n", |
|
|
|
"!curl -X POST -k http://localhost:8891/query -H 'Content-Type: application/json' -d '{\"query\": \"企业如何办理养老保险?\",\"params\": {\"Retriever\": {\"top_k\": 5}, \"Ranker\":{\"top_k\": 5}}}'\n", |
|
|
|
"\n", |
|
|
|
"报错:curl: (7) Failed to connect to localhost port 8891: Connection refused\n", |
|
|
|
"\n", |
|
|
|
"\n", |
|
|
|
"```\n", |
|
|
|
"## 报错cannot import name 'deprecated' from 'typing_extensions'\n", |
|
|
|
"File \"/opt/conda/lib/python3.7/site-packages/fastapi-0.100.1-py3.7.egg/fastapi/params.py\", line 6, in <module>\n", |
|
|
|
" from typing_extensions import Annotated, deprecated\n", |
|
|
|
"ImportError: cannot import name 'deprecated' from 'typing_extensions' (/opt/conda/lib/python3.7/site-packages/typing_extensions.py)\n", |
|
|
|
" \n", |
|
|
|
" 把typing_extensions从4.4升级到4.5,问题解决\n", |
|
|
|
" \n", |
|
|
|
"## 报错\n", |
|
|
|
" from typing import (\n", |
|
|
|
"ImportError: cannot import name 'TypedDict' from 'typing' (/opt/conda/lib/python3.7/typing.py)\n", |
|
|
|
" \n", |
|
|
|
" 这个是python3.7下会出的问题,现在python3.10,问题应该是已经解决了。" |
|
|
|
] |
|
|
|
}, |
|
|
|
{ |
|
|
|
"cell_type": "code", |
|
|
|
"execution_count": null, |
|
|
|
"id": "6b9aca69", |
|
|
|
"metadata": {}, |
|
|
|
"outputs": [], |
|
|
|
"source": [] |
|
|
|
} |
|
|
|
], |
|
|
|
"metadata": { |
|
|
|
"kernelspec": { |
|
|
|
"display_name": "Python 3 (ipykernel)", |
|
|
|
"language": "python", |
|
|
|
"name": "python3" |
|
|
|
}, |
|
|
|
"language_info": { |
|
|
|
"codemirror_mode": { |
|
|
|
"name": "ipython", |
|
|
|
"version": 3 |
|
|
|
}, |
|
|
|
"file_extension": ".py", |
|
|
|
"mimetype": "text/x-python", |
|
|
|
"name": "python", |
|
|
|
"nbconvert_exporter": "python", |
|
|
|
"pygments_lexer": "ipython3", |
|
|
|
"version": "3.10.10" |
|
|
|
} |
|
|
|
}, |
|
|
|
"nbformat": 4, |
|
|
|
"nbformat_minor": 5 |
|
|
|
} |