命名实体识别NER探索(1)
命名实体识别(Named-entity recognition ,NER)(也称为实体识别、实体分块和实体提取)是信息提取的一个子任务,旨在将非结构化文本中提到的命名实体定位并分类为预定义的类别,例如人名、组织、地名、医疗名称、时间表达式、数量,货币价值、百分比等。
目录
Tensorflow 1.x 虚拟环境部署
- Tensorflow 1.x 虚拟环境部署
- 数据的采集及清洗
- 自动标注将文本转化为深度学习的格式
新建虚拟环境
E:\>python -m venv 2020_vms_tensorflow_1
激活虚拟环境
E:\>cd E:\2020_vms_tensorflow_1\Scripts
E:\2020_vms_tensorflow_1\Scripts>activate.bat
(2020_vms_tensorflow_1) E:\2020_vms_tensorflow_1\Scripts>
安装Tensorflow 1.x tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Collecting wheel>=0.26 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a7/00/3df031b3ecd5444d572141321537080b40c1c25e1caa3d86cdd12e5e919c/wheel-0.35.1-py2.py3-none-any.whl
Collecting tensorflow-estimator==1.15.1 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/de/62/2ee9cd74c9fa2fa450877847ba560b260f5d0fb70ee0595203082dafcc9d/tensorflow_estimator-1.15.1-py2.py3-none-any.whl
Collecting keras-applications>=1.0.8 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
100% |████████████████████████████████| 51kB 276kB/s
Collecting absl-py>=0.7.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b9/07/f69dd3367368ad69f174bfe426a973651412ec11d48ec05c000f19fe0561/absl_py-0.10.0-py3-none-any.whl (127kB)
100% |████████████████████████████████| 133kB 488kB/s
Collecting google-pasta>=0.1.6 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a3/de/c648ef6835192e6e2cc03f40b19eeda4382c49b5bafb43d88b931c4c74ac/google_pasta-0.2.0-py3-none-any.whl
Collecting keras-preprocessing>=1.0.5 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/79/4c/7c3275a01e12ef9368a892926ab932b33bb13d55794881e3573482b378a7/Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42kB)
100% |████████████████████████████████| 51kB 2.1MB/s
Collecting grpcio>=1.8.6 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/15/3f/f311f382bb658387fe78a30e1ed55193fe94c5e78b37abd134c34bd256eb/grpcio-1.31.0-cp36-cp36m-win_amd64.whl
Collecting gast==0.2.2 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Collecting protobuf>=3.6.1 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f6/fe/9d8e70a86add02cb1ef35540ec03fd5b210d76323fe4645d7121b13ae33e/protobuf-3.13.0-cp36-cp36m-win_amd64.whl (1.1MB)
100% |████████████████████████████████| 1.1MB 99kB/s
Collecting astor>=0.6.0 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/c3/88/97eef84f48fa04fbd6750e62dcceafba6c63c81b7ac1420856c8dcc0a3f9/astor-0.8.1-py2.py3-none-any.whl
Collecting numpy=1.16.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/05/1d/d7b100264346a8722325987f10061b66d3c560bfb292f2c0254736e7531e/numpy-1.19.1-cp36-cp36m-win_amd64.whl (12.9MB)
100% |████████████████████████████████| 12.9MB 42kB/s
Collecting termcolor>=1.1.0 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting opt-einsum>=2.3.2 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/bc/19/404708a7e54ad2798907210462fd950c3442ea51acc8790f3da48d2bee8b/opt_einsum-3.3.0-py3-none-any.whl (65kB)
100% |████████████████████████████████| 71kB 157kB/s
Collecting six>=1.10.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ee/ff/48bde5c0f013094d729fe4b0316ba2a24774b3ff1c52d924a8a4cb04078a/six-1.15.0-py2.py3-none-any.whl
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Collecting tensorboard=1.15.0 (from tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl (3.8MB)
100% |████████████████████████████████| 3.8MB 90kB/s
Collecting h5py (from keras-applications>=1.0.8->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/0b/fa/bee65d2dbdbd3611702aafd128139c53c90a1285f169ba5467aab252e27a/h5py-2.10.0-cp36-cp36m-win_amd64.whl (2.4MB)
100% |████████████████████████████████| 2.4MB 89kB/s
Requirement already satisfied: setuptools in e:\2020_vms_tensorflow_1\lib\site-packages (from protobuf>=3.6.1->tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard=1.15.0->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl (88kB)
100% |████████████████████████████████| 92kB 138kB/s
Collecting werkzeug>=0.11.15 (from tensorboard=1.15.0->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl (298kB)
100% |████████████████████████████████| 307kB 109kB/s
Collecting importlib-metadata; python_version =2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
提示报错
Collecting zipp>=0.5 (from importlib-metadata; python_version markdown>=2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
Running setup.py bdist_wheel for wrapt ... error
Failed building wheel for wrapt
Running setup.py clean for wrapt
Failed to build wrapt
Installing collected packages: wrapt, werkzeug, zipp, importlib-metadata, markdown, tensorboard, tensorflow
Running setup.py install for wrapt ... error
Exception:
Traceback (most recent call last):
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 73, in console_to_str
return s.decode(sys.__stdout__.encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\basecommand.py", line 215, in main
status = self.run(options, args)
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\commands\install.py", line 342, in run
prefix=options.prefix_path,
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_set.py", line 784, in install
**kwargs
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\req\req_install.py", line 878, in install
spinner=spinner,
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\utils\__init__.py", line 676, in call_subprocess
line = console_to_str(proc.stdout.readline())
File "e:\2020_vms_tensorflow_1\lib\site-packages\pip\compat\__init__.py", line 75, in console_to_str
return s.decode('utf_8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa1 in position 44: invalid start byte
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
修改73行代码:
if sys.version_info >= (3,):
def console_to_str(s):
try:
return s.decode(sys.__stdout__.encoding)
except UnicodeDecodeError:
return s.decode('utf_8')
修改为:
if sys.version_info >= (3,):
def console_to_str(s):
try:
#return s.decode(sys.__stdout__.encoding)
return s.decode('cp936')
except UnicodeDecodeError:
return s.decode('utf_8')
Tensorflow 1.x 安装成功!
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>pip install tensorflow-1.15.0-cp36-cp36m-win_amd64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
Processing d:\2020_vir_tensorflow1\install_whl\tensorflow-1.15.0-cp36-cp36m-win_amd64.whl
Requirement already satisfied: google-pasta>=0.1.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting tensorboard=1.15.0 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1e/e9/d3d747a97f7188f48aa5eda486907f3b345cd409f0a0850468ba867db246/tensorboard-1.15.0-py3-none-any.whl
Requirement already satisfied: protobuf>=3.6.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: wheel>=0.26 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: opt-einsum>=2.3.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: six>=1.10.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: astor>=0.6.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-applications>=1.0.8 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting wrapt>=1.11.1 (from tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/82/f7/e43cefbe88c5fd371f4cf0cf5eb3feccd07515af9fd6cf7dbf1d1793a797/wrapt-1.12.1.tar.gz
Requirement already satisfied: grpcio>=1.8.6 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: numpy=1.16.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: absl-py>=0.7.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: gast==0.2.2 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: tensorflow-estimator==1.15.1 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Requirement already satisfied: termcolor>=1.1.0 in e:\2020_vms_tensorflow_1\lib\site-packages (from tensorflow==1.15.0)
Collecting markdown>=2.6.8 (from tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/a4/63/eaec2bd025ab48c754b55e8819af0f6a69e2b1e187611dd40cbbe101ee7f/Markdown-3.2.2-py3-none-any.whl
Collecting werkzeug>=0.11.15 (from tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/cc/94/5f7079a0e00bd6863ef8f1da638721e9da21e5bacee597595b318f71d62e/Werkzeug-1.0.1-py2.py3-none-any.whl
Collecting setuptools>=41.0.0 (from tensorboard=1.15.0->tensorflow==1.15.0)
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/b0/8b/379494d7dbd3854aa7b85b216cb0af54edcb7fce7d086ba3e35522a713cf/setuptools-50.0.0-py3-none-any.whl (783kB)
100% |████████████████████████████████| 788kB 121kB/s
Requirement already satisfied: h5py in e:\2020_vms_tensorflow_1\lib\site-packages (from keras-applications>=1.0.8->tensorflow==1.15.0)
Collecting importlib-metadata; python_version =2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/8e/58/cdea07eb51fc2b906db0968a94700866fc46249bdc75cac23f9d13168929/importlib_metadata-1.7.0-py2.py3-none-any.whl
Collecting zipp>=0.5 (from importlib-metadata; python_version markdown>=2.6.8->tensorboard=1.15.0->tensorflow==1.15.0)
Using cached https://pypi.tuna.tsinghua.edu.cn/packages/b2/34/bfcb43cc0ba81f527bc4f40ef41ba2ff4080e047acb0586b56b3d017ace4/zipp-3.1.0-py3-none-any.whl
Building wheels for collected packages: wrapt
Running setup.py bdist_wheel for wrapt ... done
Stored in directory: C:\Users\lenovo\AppData\Local\pip\Cache\wheels\68\e3\d7\4b6eee6f5d547bdfd97ba406128db66c5654dfb831fda163a2
Successfully built wrapt
Installing collected packages: zipp, importlib-metadata, markdown, werkzeug, setuptools, tensorboard, wrapt, tensorflow
Found existing installation: setuptools 28.8.0
Uninstalling setuptools-28.8.0:
Successfully uninstalled setuptools-28.8.0
Successfully installed importlib-metadata-1.7.0 markdown-3.2.2 setuptools-50.0.0 tensorboard-1.15.0 tensorflow-1.15.0 werkzeug-1.0.1 wrapt-1.12.1 zipp-3.1.0
You are using pip version 9.0.1, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>
(2020_vms_tensorflow_1) D:\2020_vir_tensorflow1\install_whl>python
Python 3.6.4 (v3.6.4:d48eceb, Dec 19 2017, 06:54:40) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
...
>>>
>>> print(tf.__version__)
1.15.0
>>>
数据的采集及清洗
本文采用医疗行业电子病历分析案例,数据及代码来源于互联网资料。电子病历文本自然语言处理研究主要关注病历文本的处理,包括句子边界识别、词性标注、句法分析等,信息抽取以自然语言处理研究为基础,主要关注病历文本中各类表达医疗知识的命名实体或医疗概念的识别和关系抽取。
- 人工标注的实体数据源 0.ann:第一列是序号,第二列是实体名称,第三列、第四列是标识实体在对应的0.txt文件的起始位置和结束位置,第五列是标识的实体名称。这是人工打标标识的文件。
......
T1 Disease 1845 1850 1型糖尿病
T2 Disease 1983 1988 1型糖尿病
T4 Disease 30 35 2型糖尿病
T5 Disease 1822 1827 2型糖尿病
T6 Disease 2055 2060 2型糖尿病
T7 Disease 2324 2329 2型糖尿病
T8 Disease 4325 4330 2型糖尿病
T9 Disease 5223 5228 2型糖尿病
.......
医生针对患者的诊疗活动可以概括为:通过患者自述(自诉症状)和检查结果(检查项目)发现疾病的表现(症状),给出诊断结论(疾病),并基于诊断结论,给出治疗措施(治疗方案),涉及信息包括:症状、疾病、检查和治疗。
- 0.ann对应的原始文本数据源 0.txt:
......
1.一般将HBA1C 。控制于
关注
打赏
热门博文
- 计算机视觉系列 -MMDetection 之MobileNetV2YOLOV3 经典算法(一)
- Rasa 3.x 学习系列- Rasa - Issues 4635:Make Rasa X model pull interval configurable in local mode
- Rasa 3.x 学习系列- Rasa - Issues 4759:Training Luis data with luis_schema_version higher than 4.x.x will
- Rasa 3.x 学习系列- Rasa - Issues 4799 rasa interactive does not work without nlu data
- Rasa 3.x 学习系列- Rasa - Issues 4917 Support S3 namespaces when retrieving models from buckets
- Rasa 3.x 学习系列- Rasa - Issues 4925 “rasa init” will ask if user wants to train a model
- Rasa 3.x 学习系列- Rasa - Issues 4985 Fix errors during training in ResponseSelector学习笔记
- Rasa 3.x 学习系列- Rasa - Issues 4933 Improved error message that appears when an incorrect paramete学习笔记
- Rasa 3.x 学习系列- Rasa - Issues 4792 socket debug logs clog up debug feed学习笔记
- Rasa 3.x 学习系列- Rasa - Issues 4873 dispatcher.utter_message 学习笔记