计算机应用   2017, Vol. 37 Issue (4): 1056-1060  DOI: 10.11772/j.issn.1001-9081.2017.04.1056 0

### 引用本文

ZHANG Shuowang, OUYANG Chunping, YANG Xiaohua, LIU Yongbin, LIU Zhiming. Word semantic similarity computation based on integrating HowNet and search engines[J]. Journal of Computer Applications, 2017, 37(4): 1056-1060. DOI: 10.11772/j.issn.1001-9081.2017.04.1056.

### 文章历史

Word semantic similarity computation based on integrating HowNet and search engines
ZHANG Shuowang, OUYANG Chunping, YANG Xiaohua, LIU Yongbin, LIU Zhiming
College of Computer Science and Technology, University of South China, Hengyang Hunan 421001, China
Abstract: According to mismatch between word semantic description of "HowNet" and subjective cognition of vocabulary, in the context of making full use of rich network knowledge, a word semantic similarity calculation method combining "HowNet" and search engine was proposed. Firstly, considering the inclusion relation between word and word sememes, the preliminary semantic similarity results were obtained by using improved concept similarity calculation method. Then the further semantic similarity results were obtained by using double correlation detection algorithm and point mutual information method based on search engines. Finally, the fitting function was designed and the weights were calculated by using batch gradient descent method, and the similarity calculation results of the first two steps were fused. The experimental results show that compared with the method simply based on "HowNet" or search engines, the Spearman coefficient and Pearson coefficient of the fusion method are both improved by 5%. Meanwhile, the match degree of the semantic description of the specific word and subjective cognition of vocabulary is improved. It is proved that it is effective to integrate network knowledge background into concept similarity calculation for computing Chinese word semantic similarity.
Key words: semantic similarity    HowNet    search engine    weight    network
0 引言

1 《知网》词汇语义相似度计算方法

 $\begin{array}{l} Sim({s_1}, {s_2}) = \sum\limits_{i = 1}^4 {{\beta _i}Si{m_i}\left( {{s_1}, {s_2}} \right)} \\ {\beta _1} + {\beta _2} + {\beta _3} + {\beta _4} = 1, {\beta _1} \ge {\beta _2} \ge {\beta _3} \ge {\beta _4} \end{array}$

 $\begin{array}{l} Sim({s_1}, {s_2}) = \sum\limits_{i = 1}^4 {{\beta _i}Si{m_i}\left( {{s_1}, {s_2}} \right)} \\ {\beta _i} = \frac{{{k_i}}}{{m + n}}, {\rm{(}}i = 1, 2, 3, 4{\rm{)}} \end{array}$

2 本文算法

2.1 基于《知网》的词汇语义相似度算法改进

 $Sim{\rm{(}}{s_1}, {s_2}{\rm{)}} = \sum\limits_{i = 1}^4 {{\beta _i}Si{m_i}\left( {{s_1}, {s_2}} \right)} + \frac{1}{{Num{\rm{(}}{s_{_j}}{\rm{)}} + 1}}$ (1)

2.2 基于搜索引擎的词汇语义相似度算法改进

 $PMI(a,b)=\text{lb}\frac{N\times N(a,b)}{(N(a)\times N(b))}\text{/lb}(N)$

 $PMIB(a,b)=\text{lb}\frac{{{N}_{b}}\times N(a,b)}{(N(a)+N(b))}\text{/lb}({{\text{N}}_{\text{b}}})$

 $CODC{\rm{(}}a, b{\rm{)}} = \left\{ \begin{array}{l} 0{\kern 1pt} {\kern 1pt}, {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} f{\rm{(}}a@b{\rm{)}} = 0或{\kern 1pt} {\kern 1pt} f{\rm{(}}b@a{\rm{)}} = 0\\ {\rm{exp[}}\beta \times {\rm{ln(}}\frac{{f{\rm{(}}a@b)}}{{H{\rm{(}}a{\rm{)}}}} \times \frac{{f{\rm{(}}a@b{\rm{)}}}}{{H{\rm{(}}b{\rm{)}}}}{\rm{)}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{]}}{\kern 1pt} {\kern 1pt}, \\ 其他 \end{array} \right.$

 $CODCB{\rm{(}}a, b) = \left\{ \begin{array}{l} 0{\kern 1pt} {\kern 1pt}, {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} f(a@b) = 0{\kern 1pt} {\kern 1pt} {\rm{and}}{\kern 1pt} {\kern 1pt} f(b@a) = 0\\ {\rm{exp}}(\beta \times {\rm{ln}}\left[{(\frac{{f(a@b)}}{{H(a)}} + \frac{{f(a@b)}}{{H(b)}})/2} \right]){\kern 1pt} {\kern 1pt} {\kern 1pt}, \\ 其他 \end{array} \right.$
2.3 融合《知网》与搜索引擎的词汇语义相似度计算

 $\begin{array}{l} Sim(a, b) = (1 - {w_1}) \times Si{m_{\rm{Z}}}(a, b) + {w_1} \times Si{m_{\rm{S}}}(a, b)\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{w_1} = {\rm{sigmod}}({w_2} \times {\rm{lg}}{n_1} + {w_3} \times {\rm{lg}}{n_2})\\ Si{m_{\rm{S}}}(a, b) = {\rm{sigmod}}({w_4}) \times CODCB(a, b) + \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(1 - {\rm{sigmod}}({w_4})) \times PMIB(a, b)\\ L(y, w) = \frac{1}{m}\sum\limits_{i = 1}^m {{{({y_i} - f(x, w))}^2}} \end{array}$

3 实验与分析

4 结语

 [1] 董强, 董振东. 知网简介[EB/OL]. [2013-01-29]. http://www.keenage.com/zhiwang/c_zhiwang.html. ( DONG Q, DONG Z D. HowNet knowledge database[EB/OL].[2013-01-29]. http://www.keenage.com/zhiwang/c_zhiwang.html. ) [2] 刘群, 李素建. 基于《知网》的词汇语义相似度的计算[EB/OL]. [2015-01-12]. http://www.nlp.org.cn/Admin/kindeditor/attached/file/20130508/20130508094157_16839.pdf. ( LIU Q, LI S J. Word similarity computing based on HowNet[EB/OL].[2015-01-12]. http://www.nlp.org.cn/Admin/kindeditor/attached/file/20130508/20130508094157_16839.pdf. ) [3] 王小林, 王义. 改进的基于知网的词语相似度算法[J]. 计算机应用, 2011, 31 (11) : 3075-3077. ( WANG X L, WANG Y. Improved word similarity algorithm based on HowNet[J]. Journal of Computer Applications, 2011, 31 (11) : 3075-3077. ) [4] 夏天. 汉语词语语义相似度计算研究[J]. 计算机工程, 2007, 33 (6) : 191-194. ( XIA T. Study on Chinese words semantic similarity computation[J]. Computer Engineering, 2007, 33 (6) : 191-194. ) [5] 朱征宇, 孙俊华. 改进的基于《知网》的词汇语义相似度计算[J]. 计算机应用, 2013, 33 (8) : 2276-2279. ( ZHU Z Y, SUN J H. Improved vocabulary semantic similarity calculation based on HowNet[J]. Journal of Computer Applications, 2013, 33 (8) : 2276-2279. doi: 10.3724/SP.J.1087.2013.02276 ) [6] 朱新华, 马润聪, 孙柳, 等. 基于知网与词林的词语语义相似度计算[J]. 中文信息学报, 2016, 30 (4) : 29-36. ( ZHU X H, MA R C, SUN L, et al. Word semantic similarity computation based on HowNet and CiLin[J]. Journal of Chinese Information Processing, 2016, 30 (4) : 29-36. ) [7] 吴奎, 周献中, 王建宇, 等. 基于贝叶斯估计的概念语义相似度算法[J]. 中文信息学报, 2010, 24 (2) : 52-57. ( WU K, ZHOU X Z, WANG J Y, et al. A concept semantic similarity algorithm based on Bayesian estimation[J]. Journal of Chinese Information Processing, 2010, 24 (2) : 52-57. ) [8] 张春红. 中文维基百科的结构化信息抽取及词语相关度计算[D]. 武汉: 华中师范大学, 2011. ( ZHANG C H. Extracting structured information from the Chinese Wikipedia and measuring relatedness between words[D]. Wuhan: Central China Normal University, 2011. ) [9] CHEN H H, LIN M S, WEI Y C. Novel association measures using Web search with double checking[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2006: 1009-1016. [10] CILIBRASI R L, VITANYI P M B. The Google similarity distance[J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19 (3) : 370-383. doi: 10.1109/TKDE.2007.48 [11] 高国强, 黄吕威, 陈丰钰. 使用网络搜索引擎计算汉语词汇的语义相似度[J]. 计算机技术与发展, 2014, 24 (7) : 84-87. ( GAO G Q, HUANG L W, CHEN F Y. Calculation of Chinese words semantic similarity using network search engines[J]. Computer Technology and Development, 2014, 24 (7) : 84-87. ) [12] 陈海燕. 基于搜索引擎的词汇语义相似度计算方法[J]. 计算机科学, 2015, 42 (1) : 261-267. ( CHEN H Y. Measuring semantic similarity between words using Web search engines[J]. Computer Science, 2015, 42 (1) : 261-267. ) [13] BOLLEGALA D, MATSUO Y, ISHIZUKA M. A Web search engine-based approach to measure semantic similarity between words[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 23 (7) : 977-990. doi: 10.1109/TKDE.2010.172 [14] 李峰, 李芳. 中文词语语义相似度计算——基于《知网》2000[J]. 中文信息学报, 2007, 21 (3) : 99-105. ( LI F, LI F. An new approach measuring semantic similarity in HowNet 2000[J]. Journal of Chinese Information Processing, 2007, 21 (3) : 99-105. ) [15] LIN D. An information theoretic definition of similarity semantic distance in WordNet[C]//ICML 1998: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann, 1998: 296-304. [16] FIRTH J R. A synopsis of linguistic theory 1930-1955[J]. Studies in Linguistic Analysis (Special Volume of the Philological Society), 1957, 41 (4) : 1-32.