计算机应用   2017, Vol. 37 Issue (4): 1061-1064  DOI: 10.11772/j.issn.1001-9081.2017.04.1061 0

引用本文

HUANG Yu, ZHANG Hong. Cross-media retrieval based on latent semantic topic reinforce[J]. Journal of Computer Applications, 2017, 37(4): 1061-1064. DOI: 10.11772/j.issn.1001-9081.2017.04.1061.

文章历史

1. 武汉科技大学 计算机科学与技术学院, 武汉 430065;
2. 智能信息处理与实时工业系统湖北省重点实验室 (武汉科技大学), 武汉 430065

Cross-media retrieval based on latent semantic topic reinforce
HUANG Yu1,2, ZHANG Hong1,2
1. School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan Hubei 430065, China;
2. Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System (Wuhan University of Science and Technology), Wuhan Hubei 430065, China
Abstract: As an important and challenging problem in the multimedia area, common semantic topic has different expression across different modalities, and exploring the intrinsic semantic information from different modalities in a collaborative manner was usually neglected by traditional cross-media retrieval methods. To address this problem, a Latent Semantic Topic Reinforce cross-media retrieval (LSTR) method was proposed. Firstly, the text semantic was represented based on Latent Dirichlet Allocation (LDA) and the corresponding images were represented with Bag of Words (BoW) model. Secondly, multiclass logistic regression was used to classify both texts and images, and the posterior probability under the learned classifiers was exploited to indicate the latent semantic topic of images and texts. Finally, the learned posterior probability was used to regularize their image counterparts to reinforce the image semantic topics, which greatly improved the semantic similarity between them. In the Wikipedia data set, the mean Average Precision (mAP) of retrieving text with image and retrieving image with text is 57.0%, which is 35.1%, 34.8% and 32.1% higher than that of the Canonical Correlation Analysis (CCA), Semantic Matching (SM) and Semantic Correlation Matching (SCM) method respectively. Experimental results show that the proposed method can effectively improve the average precision of cross-media retrieval.
Key words: cross-media retrieval    latent semantic topic    multiclass logistic regression    posterior probability    regularization
0 引言

1) 利用多分类逻辑回归对图像和文本进行分类, 得到分类模型, 然后利用分类模型计算图像和文本基于多分类的后验概率, 使用该后验概率向量表示图像和文本的潜语义主题。

2) 由于文本的潜语义主题比图像潜语义主题更加明晰, 为了使文本和图像的潜语义主题的相关性最大, 用文本潜语义主题正则化图像潜语义主题, 使图像和文本的潜语义主题趋于一致。

3) 利用皮尔逊相关系数来度量文本和图像向量之间的相似性, 实现图像和文本之间的相互检索。

1 提取图像和文本的潜语义主题

 $\begin{array}{l} {\mathit{\boldsymbol{h}}_\mathit{\boldsymbol{\theta }}}\left( {{\mathit{\boldsymbol{x}}^{\left( i \right)}}} \right) = \left[\begin{array}{l} p\left( {{\mathit{\boldsymbol{y}}^{\left( i \right)}} = 1\left| {{\mathit{\boldsymbol{x}}^{\left( i \right)}};\mathit{\boldsymbol{\theta }}} \right.} \right)\\ p\left( {{\mathit{\boldsymbol{y}}^{\left( i \right)}} = 2\left| {{\mathit{\boldsymbol{x}}^{\left( i \right)}};e} \right.} \right)\\ \;\;\;\;\;\;\;\;\;\; \vdots \\ p\left( {{\mathit{\boldsymbol{y}}^{\left( i \right)}} = k\left| {{\mathit{\boldsymbol{x}}^{\left( i \right)}};\mathit{\boldsymbol{\theta }}} \right.} \right) \end{array} \right] = \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\frac{1}{{\sum\limits_{j = 1}^k {\exp \left( {{\mathit{\boldsymbol{\theta }} ^\rm{T}}_j{\boldsymbol{x}^{\left( i \right)}}} \right)} }}\left[\begin{array}{l} {\rm{exp}}\left( {\mathit{\boldsymbol{\theta }}_1^{\rm{T}}{\mathit{\boldsymbol{x}}^{\left( i \right)}}} \right)\\ {\rm{exp}}\left( {\mathit{\boldsymbol{\theta }}_2^{\rm{T}}{\mathit{\boldsymbol{x}}^{\left( i \right)}}} \right)\\ \;\;\;\;\;\;\; \vdots \\ {\rm{exp}}\left( {\mathit{\boldsymbol{\theta }}_k^{\rm{T}}{\mathit{\boldsymbol{x}}^{\left( i \right)}}} \right) \end{array} \right] \end{array}$ (1)

 $\begin{array}{l} J\left( \mathit{\boldsymbol{\theta }} \right) =- \frac{1}{m}\left[{\sum\limits_{i = 1}^m {\sum\limits_{j = 1}^k {1\left\{ {{\mathit{\boldsymbol{y}}^{\left( i \right)}} = j} \right\}{\rm{log}}\frac{{{\rm{exp}}\left( {\mathit{\boldsymbol{\theta }}_j^{\rm{T}}{\mathit{\boldsymbol{x}}^{\left( i \right)}}} \right)}}{{\sum\limits_{l = 1}^k {{\rm{exp}}\left( {\mathit{\boldsymbol{\theta }}_l^{\rm{T}}{\mathit{\boldsymbol{x}}^{\left( i \right)}}} \right)} }}} } } \right]{\rm{ + }}\\ \;\;\;\;\;\;\;\;\;\;\frac{\lambda }{2}\sum\limits_{i = 1}^k {\sum\limits_{j = 0}^n {\mathit{\boldsymbol{\theta }}_{ij}^2} } \end{array}$ (2)

2 基于正则化的潜语义主题加强

 图 1 基于潜语义主题加强的跨媒体检索算法 Figure 1 Latent semantic topic reinforce cross-media retrieval

 $\boldsymbol{H}:{\boldsymbol{x}_i} \to {\boldsymbol{t}_i}$ (3)

H为一个线性转换矩阵:

 $\boldsymbol{T} = \boldsymbol{XH}$ (4)

 $\left( \begin{array}{l} {\boldsymbol{t}_1}^T\\ {\boldsymbol{t}_2}^T\\ \vdots \\ {\boldsymbol{t}_n}^T \end{array} \right) = \left( \begin{array}{l} {\boldsymbol{x}_1}^T\\ {\boldsymbol{x}_2}^T\\ \vdots \\ {\boldsymbol{x}_n}^T \end{array} \right)\left[{{\boldsymbol{h}_1}, {\boldsymbol{h}_2}, \ldots, {\boldsymbol{h}_k}} \right]$ (5)

 ${\boldsymbol{x}_i}^T{\boldsymbol{h}_k} \ge 0, \forall i = 1, 2, \ldots, N{\rm{;}}\forall k = 1, 2, \ldots, K$ (6)
 $\sum {{\boldsymbol{x}_i}^{\rm{T}}\boldsymbol{H} = 1, \forall i = 1, 2, \ldots, K}$ (7)

 $\boldsymbol{b} = \boldsymbol{Mx}$ (8)

 $\left[{\begin{array}{*{20}{c}} {{\mathit{\boldsymbol{t}}_1}}\\ {{\mathit{\boldsymbol{t}}_2}}\\ \vdots \\ {{\mathit{\boldsymbol{t}}_N}} \end{array}} \right] = \left[{\begin{array}{*{20}{c}} {\mathit{\boldsymbol{x}}_1^{\rm{T}}}&0& \cdots &0\\ 0&{\mathit{\boldsymbol{x}}_1^{\rm{T}}}& \cdots &0\\ \vdots &{\; \vdots }&{}&{\; \vdots }\\ 0&0& \cdots &{\mathit{\boldsymbol{x}}_1^{\rm{T}}}\\ {\mathit{\boldsymbol{x}}_2^{\rm{T}}}&0& \cdots &0\\ {\; \vdots }&{\; \vdots }&{}&{\; \vdots }\\ 0&0& \cdots &{\mathit{\boldsymbol{x}}_N^{\rm{T}}} \end{array}} \right]\left[{\begin{array}{*{20}{c}} {{\mathit{\boldsymbol{h}}_1}}\\ {{\mathit{\boldsymbol{h}}_2}}\\ \vdots \\ {{\mathit{\boldsymbol{h}}_L}} \end{array}} \right]$ (9)

 $\mathit{\boldsymbol{S}} = \left[{\begin{array}{*{20}{c}} {{\mathit{\boldsymbol{x}}_1}^T}&{{\mathit{\boldsymbol{x}}_1}^T}& \cdots &{{\mathit{\boldsymbol{x}}_1}^T}\\ {{\mathit{\boldsymbol{x}}_2}^T}&{{\mathit{\boldsymbol{x}}_2}^T}& \cdots &{{\mathit{\boldsymbol{x}}_2}^T}\\ \vdots & \vdots &{}& \vdots \\ {{\mathit{\boldsymbol{x}}_N}^T}&{{\mathit{\boldsymbol{x}}_N}^T}& \cdots &{{\mathit{\boldsymbol{x}}_N}^T} \end{array}} \right]$ (10)

 ${\mathit{\boldsymbol{x}}^*} = \mathop {{\rm{arg}}\;{\rm{min}}}\limits_\mathit{\boldsymbol{x}} {\left\| {\mathit{\boldsymbol{Mx}}-\mathit{\boldsymbol{b}}} \right\|_2}^2$ (11)
 ${\rm{s}}{\rm{.t}}{\rm{.}}\;\;\;\;\;\mathit{\boldsymbol{Mx}} \ge \mathit{\boldsymbol{0}}{\rm{ ; }}\mathit{\boldsymbol{Sx}} = \mathit{\boldsymbol{1}}$

1) 根据式 (1)、(2) 求解得到图像和文本的潜语义主题。

2) 对每一个类别 (i=1, 2, …, L) 求解:

${\mathit{\boldsymbol{x}}^*}=\mathop {{\rm{argmin}}}\limits_\mathit{\boldsymbol{x}} {\left\| {\mathit{\boldsymbol{Mx}}-\mathit{\boldsymbol{b}}} \right\|_2}^2$

${\rm{s}}{\rm{.t}}{\rm{.}}\; \; \; \mathit{\boldsymbol{Mx}} \ge \mathit{\boldsymbol{0}}{\rm{; }}\mathit{\boldsymbol{Sx}}=\mathit{\boldsymbol{1}}$

3 实验分析 3.1 实验数据集和数据表示

3.2 度量标准

 $\begin{array}{l} {\rho _{x, y}} = \frac{{{\mathop{\rm cov}} \left( {\mathit{\boldsymbol{X}}, Y} \right)}}{{{\sigma _\mathit{\boldsymbol{X}}}{\sigma _Y}}} = \\ \;\;\;\;\;\;\;\;\;\frac{{E\left( {\left( {\mathit{\boldsymbol{X}}-{\mu _\mathit{\boldsymbol{X}}}} \right)\left( {Y-{\mu _Y}} \right)} \right)}}{{\sqrt {E\left( {{\mathit{\boldsymbol{X}}^2}} \right)-{E^2}\left( \mathit{\boldsymbol{X}} \right)\sqrt {E\left( {{Y^2}} \right) - {E^2}\left( Y \right)} } }} \end{array}$
3.3 实验结果的评价

 $AP = \frac{1}{L}\sum\limits_{r = 1}^R {prec\left( r \right)\delta \left( r \right)}$

3.4 实验结果与分析

 图 2 不同类别样例的平均查准率 (图像检索文本) Figure 2 mAP for different classes (retrieving text with image)
 图 3 不同类别样例的平均查准率 (文本检索图像) Figure 3 mAP for different classes (retrieving image with text)
 图 4 不同类别样例的平均查准率 Figure 4 mAP for different classes

 图 5 图像检索文本的准确率-召回率曲线 Figure 5 Precision-recall curves of retrieving text with image
 图 6 文本检索图像的准确率-召回率曲线 Figure 6 Precision-recall curves of retrieving image with text

4 结语

 [1] 吴飞, 庄越挺. 互联网跨媒体分析与检索:理论与算法[J]. 计算机辅助设计与图形学学报, 2010, 22 (1) : 1-9. ( WU F, ZHUANG Y T. Cross media analysis and retrieval on the Web: theory and algorithm[J]. Journal of Computer-Aided Design and Computer Graphics, 2010, 22 (1) : 1-9. doi: 10.3724/SP.J.1089.2010.10439 ) [2] CHEN X, LIU H, CARBONELL J G. Structured sparse canonical correlation analysis[EB/OL].[2016-03-10]. https://www.cs.cmu.edu/~jgc/StructuredSparseCanonicalCorrelationAnalysisAISTATS2012.pdf. [3] 张鸿, 吴飞, 庄越挺, 等. 一种基于内容相关性的跨媒体检索方法[J]. 计算机学报, 2008, 31 (5) : 820-826. ( ZHANG H, WU F, ZHUANG Y T, et al. Cross-media retrieval method based on content correlation[J]. Chinese Journal of Computers, 2008, 31 (5) : 820-826. ) [4] PUTTHIVIDHY D, ATTIAS H T, NAGARAJAN S S. Topic regression multi-modal latent Dirichlet allocation for image annotation[C]//Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2010: 3408-3415. [5] WU F, JIANG X, LI X, et al. , Cross-modal learning to rank via latent joint representation[J]. IEEE Transactions on Image Processing, 2015, 24 (5) : 1497-1509. doi: 10.1109/TIP.2015.2403240 [6] GONG Y, KE Q, ISARD M, et al. A multi-view embedding space for modeling Internet images, tags, and their semantics[J]. International Journal of Computer Vision, 2014, 106 (2) : 210-233. doi: 10.1007/s11263-013-0658-4 [7] ZHEN Y, YEUNG D Y. A probabilistic model for multimodal hash function learning[C]//KDD 2012: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2012: 940-948. [8] SHANG X, ZHANG H, CHUA T-S. Deep learning generic features for cross-media retrieval[C]//MMM 2016: Proceedings of the 22nd International Conference on MultiMedia Modeling, LNCS 9516. Berlin: Springer, 2016: 264-275. [9] FROME A, CORRADO G, SHLENS J, et al. DeViSE: a deep visual-semantic embedding model[EB/OL].[2016-03-10]. https://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf. [10] MA L, LU Z, SHANG L, et al. Multimodal convolutional neural networks for matching image and sentence[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway, NJ: IEEE, 2015: 2623-2631. [11] SRIVASTAVA N, SALAKHUTDINOV R. Multimodal learning with deep Botzmann machines[EB/OL].[2016-03-10]. http://jmlr.org/papers/volume15/srivastava14b/srivastava14b.pdf. [12] WU F, YU Z, YI Y, et al. Sparse multi-modal hashing[J]. IEEE Transactions on Multimedia, 2014, 16 (2) : 427-439. doi: 10.1109/TMM.2013.2291214 [13] ZHUANG Y, YU Z, WANG W, et al. Cross-media hashing with neural networks[C]//MM 2014: Proceedings of the 22nd ACM International Conference on Multimedia. New York: ACM, 2014: 901-904. [14] RAFAILIDIS D, CRESTANI F. Cluster-based joint matrix factorization hashing for cross-modal retrieval[C]//SIGIR 2016: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2016: 781-784. [15] ZHAO F, HUANG Y, WANG L, et al. Deep semantic ranking based hashing for multi-label image retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 1556-1564.