计算机应用   2017, Vol. 37 Issue (4): 924-927,935  DOI: 10.11772/j.issn.1001-9081.2017.04.0924 0

### 引用本文

WANG Yaqiang, TANG Ming, ZENG Qin, TANG Dan, SHU Hongping. Cascaded and low-consuming online method for large-scale Web page category acquisition[J]. Journal of Computer Applications, 2017, 37(4): 924-927,935. DOI: 10.11772/j.issn.1001-9081.2017.04.0924.

### 文章历史

1. 成都信息工程大学 软件工程学院, 成都 610225;
2. 广东省气象台, 广州 510080

Cascaded and low-consuming online method for large-scale Web page category acquisition
WANG Yaqiang1, TANG Ming1, ZENG Qin2, TANG Dan1, SHU Hongping1
1. College of Software Engineering, Chengdu University of Information Technology, Chengdu Sichuan 610225, China;
2. Guangdong Meteorological Observatory, Guangzhou Guangdong 510080, China
Abstract: To balance the contradiction between accuracy and resource cost during constructing an automatic system for collecting massive well-classified Web pages, a cascaded and low-consuming online method for large-scale Web page category acquisition was proposed, which utilizes a cascaded strategy to integrate online and offline Web page classifiers so as to take full of use of their advantages. An online Web page classifier trained by features in the anchor text was used as the first-level classifier, and then the confidence of the classification results was computed by the information entropy of the posterior probability. The second-level classifier was triggered when the confidence is larger than the predefined threshold obtained by Multi-Objective Particle Swarm Optimization (MOPSO). The features were extracted from the downloaded Web pages by the secondary classifier, then they were classified by an offline classifier pre-trained by Web pages. In the comparison experiments with single online classification and single offline classification, the proposed method dramatically increased the F1 measure of classification by 10.85% and 4.57% respectively. Moreover, compared with the single online classification, the efficiency of the proposed method did not decrease a lot (less than 30%), while the efficiency was improved about 70% compared with single offline classification. The results demonstrate that the proposed method not only has a more powerful classification ability, but also significantly reduces the computing overhead and bandwidth consumption.
Key words: large scale Web page acquisition    Web page classification    cascaded classifier    confidence function    Multi-Objective Particle Swarm Optimization (MOPSO)
0 引言

1 级联式网页分类系统

1.1 基本框架

 图 1 级联式分类系统的基本框架 Figure 1 Basic architecture of cascaded classification system

1.2 置信度函数的确定方法

 图 2 条件概率分布的确定与不确定判断结果情况 Figure 2 Certain and uncertain discrimination conditions of the conditional probability distributions

 $H(C) = \sum\limits_{j = 1}^m {P({c_j}|x){\rm{lb}}P({c_j}|x)}$ (1)

1.3 置信度阈值的选取方法

1.4 级联分类系统的时间开销估算

 ${T_{\rm{c}}} = {T_1} + {T_2} = N \cdot {v_1} + (CN + EN) \cdot {v_2}$ (2)

2 实验结果及分析 2.1 分类器参数的设定

2.2 NB与SVM的性能分析

 图 3 在线与离线方法在预定类别下的F1值 Figure 3 F1 scores of different predefined classes achieved by online and offline classifiers
2.3 置信度阈值与样本数的相关性分析

 图 4 置信度阈值与样本数的相关性 Figure 4 Correlation between confidence threshold and number of instances
2.4 置信度阈值的选取

 图 5 NB+SVM级联分类系统的置信度阈值Pareto曲线 Figure 5 Pareto curve of confidence threshold by using NB+SVM cascaded classification system
2.5 级联分类系统的分类能力分析

 图 6 在线、离线和级联分类系统在预定类别下的F1值 Figure 6 F1 scores of different predefined classes achieved by online, offline, and cascaded classification system

 图 7 在线、离线、级联分类系统在预定类别下的F1值 Figure 7 F1 scores of different predefined classes achieved by online, offline and cascaded classification system

3 结语

 [1] PANT G, SRINIVASAN P. Link contexts in classifier-guided topical crawlers[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18 (1) : 107-122. doi: 10.1109/TKDE.2006.12 [2] FENG G, GUO J, JING B-Y, et al. Feature subset selection using Naïve Bayes for text classification[J]. Pattern Recognition Letters, 2015, 65 : 109-115. doi: 10.1016/j.patrec.2015.07.028 [3] LIU B, BLASCH E, CHEN Y, et al. Scalable sentiment classification for big data analysis using Naïve Bayes classifier[C]//Proceedings of 2013 IEEE International Conference on Big Data. Piscataway, NJ: IEEE, 2013:99-104. [4] CHANG C C, LIN C J. LIBSVM: a library for support vector machines[J]. ACM Transactions on Intelligent Systems and Technology, 2011, 2 (3) : Article No. 27. [5] HWANG Y S, KWON J B, MOON J C, et al. Classifying malicious Web pages by using an adaptive support vector machine[J]. Journal of Information Processing Systems, 2013, 9 (3) : 395-404. doi: 10.3745/JIPS.2013.9.3.395 [6] WU G, LI L, HU X, et al. Web news extraction via path ratios[C]//CIKM 2013: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York: ACM, 2013:2059-2068. [7] 韩国辉, 陈黎, 梁时木, 等. Naïve Bayes分类器制导的专业网页爬取算法[J]. 中文信息学报, 2010, 24 (4) : 32-38. ( HAN G H, CHEN L, LIANG S M, et al. Naïve Bayesian classifier guided domain specific webpage crawling algorithm[J]. Journal of Chinese Information Processing, 2010, 24 (4) : 32-38. ) [8] RAJALAKSHMI R, ARAVINDAN C. Web page classification using n-gram based URL features[C]//Proceedings of the 2013 Fifth International Conference on Advanced Computing. Piscataway, NJ: IEEE, 2013:15-21. http://www.academia.edu/12270221/Journal_of_Computer_Science_January_2014 [9] TRAPEZNIKOV K, SALIGRAMA V, CASTANON D. Multi-stage classifier design[J]. Machine Learning, 2013, 92 (2) : 479-502. [10] KAYNAK C, ALPAYDIN E. Multistage cascading of multiple classifiers: one man's noise is another man's data[C]//ICML 2000: Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers, 2000: 455-462. [11] FUMERA G, ROLI F, GIACINTO G. Reject option with multiple thresholds[J]. Pattern Recognition, 2000, 33 (12) : 2099-2101. doi: 10.1016/S0031-3203(00)00059-5 [12] 裴胜玉, 周永权. 基于Pareto最优解集的多目标粒子群优化算法[J]. 计算机工程与科学, 2010, 32 (11) : 85-88. ( PEI S Y, ZHOU Y Q. A multi-objective particle swarm algorithm based on the Pareto optimization solution set[J]. Computer Engineering and Science, 2010, 32 (11) : 85-88. ) [13] TEA T, BOGDAN F. Visualization of Pareto front approximations in evolutionary multiobjective optimization: a critical review and the prosection method[J]. IEEE Transactions on Evolutionary Computation, 2015, 19 (2) : 225-245. doi: 10.1109/TEVC.2014.2313407 [14] CHEN R-C, HSIEH C-H. Web page classification based on a support vector machine using a weighted vote schema[J]. Expert Systems with Applications, 2006, 31 (2) : 427-435. doi: 10.1016/j.eswa.2005.09.079 [15] SEBASTIANI F. Machine learning in automated text categorization[J]. ACM Computing Surveys, 2002, 34 (1) : 1-47. doi: 10.1145/505282.505283