Word semantic similarity computation based on integrating HowNet and search engines
ZHANG Shuowang, OUYANG Chunping, YANG Xiaohua, LIU Yongbin, LIU Zhiming
College of Computer Science and Technology, University of South China, Hengyang Hunan 421001, China
Abstract: According to mismatch between word semantic description of "HowNet" and subjective cognition of vocabulary, in the context of making full use of rich network knowledge, a word semantic similarity calculation method combining "HowNet" and search engine was proposed. Firstly, considering the inclusion relation between word and word sememes, the preliminary semantic similarity results were obtained by using improved concept similarity calculation method. Then the further semantic similarity results were obtained by using double correlation detection algorithm and point mutual information method based on search engines. Finally, the fitting function was designed and the weights were calculated by using batch gradient descent method, and the similarity calculation results of the first two steps were fused. The experimental results show that compared with the method simply based on "HowNet" or search engines, the Spearman coefficient and Pearson coefficient of the fusion method are both improved by 5%. Meanwhile, the match degree of the semantic description of the specific word and subjective cognition of vocabulary is improved. It is proved that it is effective to integrate network knowledge background into concept similarity calculation for computing Chinese word semantic similarity.
Key words: semantic similarity    HowNet    search engine    weight    network
0 引言

1 《知网》词汇语义相似度计算方法

 $\begin{array}{l} Sim({s_1}, {s_2}) = \sum\limits_{i = 1}^4 {{\beta _i}Si{m_i}\left( {{s_1}, {s_2}} \right)} \\ {\beta _1} + {\beta _2} + {\beta _3} + {\beta _4} = 1, {\beta _1} \ge {\beta _2} \ge {\beta _3} \ge {\beta _4} \end{array}$

 $\begin{array}{l} Sim({s_1}, {s_2}) = \sum\limits_{i = 1}^4 {{\beta _i}Si{m_i}\left( {{s_1}, {s_2}} \right)} \\ {\beta _i} = \frac{{{k_i}}}{{m + n}}, {\rm{(}}i = 1, 2, 3, 4{\rm{)}} \end{array}$

2 本文算法

2.1 基于《知网》的词汇语义相似度算法改进

 $Sim{\rm{(}}{s_1}, {s_2}{\rm{)}} = \sum\limits_{i = 1}^4 {{\beta _i}Si{m_i}\left( {{s_1}, {s_2}} \right)} + \frac{1}{{Num{\rm{(}}{s_{_j}}{\rm{)}} + 1}}$ (1)

2.2 基于搜索引擎的词汇语义相似度算法改进

 $PMI(a,b)=\text{lb}\frac{N\times N(a,b)}{(N(a)\times N(b))}\text{/lb}(N)$

 $PMIB(a,b)=\text{lb}\frac{{{N}_{b}}\times N(a,b)}{(N(a)+N(b))}\text{/lb}({{\text{N}}_{\text{b}}})$

 $CODC{\rm{(}}a, b{\rm{)}} = \left\{ \begin{array}{l} 0{\kern 1pt} {\kern 1pt}, {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} f{\rm{(}}a@b{\rm{)}} = 0或{\kern 1pt} {\kern 1pt} f{\rm{(}}b@a{\rm{)}} = 0\\ {\rm{exp[}}\beta \times {\rm{ln(}}\frac{{f{\rm{(}}a@b)}}{{H{\rm{(}}a{\rm{)}}}} \times \frac{{f{\rm{(}}a@b{\rm{)}}}}{{H{\rm{(}}b{\rm{)}}}}{\rm{)}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{]}}{\kern 1pt} {\kern 1pt}, \\ 其他 \end{array} \right.$

 $CODCB{\rm{(}}a, b) = \left\{ \begin{array}{l} 0{\kern 1pt} {\kern 1pt}, {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} f(a@b) = 0{\kern 1pt} {\kern 1pt} {\rm{and}}{\kern 1pt} {\kern 1pt} f(b@a) = 0\\ {\rm{exp}}(\beta \times {\rm{ln}}\left[{(\frac{{f(a@b)}}{{H(a)}} + \frac{{f(a@b)}}{{H(b)}})/2} \right]){\kern 1pt} {\kern 1pt} {\kern 1pt}, \\ 其他 \end{array} \right.$
2.3 融合《知网》与搜索引擎的词汇语义相似度计算

 $\begin{array}{l} Sim(a, b) = (1 - {w_1}) \times Si{m_{\rm{Z}}}(a, b) + {w_1} \times Si{m_{\rm{S}}}(a, b)\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{w_1} = {\rm{sigmod}}({w_2} \times {\rm{lg}}{n_1} + {w_3} \times {\rm{lg}}{n_2})\\ Si{m_{\rm{S}}}(a, b) = {\rm{sigmod}}({w_4}) \times CODCB(a, b) + \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(1 - {\rm{sigmod}}({w_4})) \times PMIB(a, b)\\ L(y, w) = \frac{1}{m}\sum\limits_{i = 1}^m {{{({y_i} - f(x, w))}^2}} \end{array}$

3 实验与分析

4 结语

