Paper
23 August 2022 Webpage text extraction algorithm based on text block density and tag path features
Renjie Wang, Yangsen Zhang, Zhenyu Hou, Jianlong Li, Zhenjiang Su, Shaohui Xie, Zhuofan Huang
Author Affiliations +
Proceedings Volume 12330, International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022); 123301G (2022) https://doi.org/10.1117/12.2646343
Event: International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022), 2022, Huzhou, China
Abstract
In addition to the body text, web pages also contain a lot of noise information such as advertisements and navigation bars. Accurately extracting text content from web pages is a key technology to improve the quality of web page analysis. The web page itself is a highly heterogeneous special text, and different types of web pages have different web page structures, which increases the difficulty of web page text extraction. After a lot of analysis, we found that there is a potential correlation between the body text and the tag path and text block density, so we propose a webpage text extraction method based on the tag path feature and the text block density feature. Combining the advantages and disadvantages of tag path features and text block density features, we design a fusion strategy to solve the problem of low accuracy of web page text extraction. The method does not require training, and improves the efficiency of webpage text extraction. The experimental results on the dataset constructed in this paper show that the classification accuracy of this method reaches 81.11%, the recall rate reaches 83.15%, and the average accuracy on all datasets is 17.7% higher than that of the BDF algorithm and 6.21% higher than that of the CEPF algorithm, the experiments show that the method has strong generalization ability.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Renjie Wang, Yangsen Zhang, Zhenyu Hou, Jianlong Li, Zhenjiang Su, Shaohui Xie, and Zhuofan Huang "Webpage text extraction algorithm based on text block density and tag path features", Proc. SPIE 12330, International Conference on Cyber Security, Artificial Intelligence, and Digital Economy (CSAIDE 2022), 123301G (23 August 2022); https://doi.org/10.1117/12.2646343
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Feature extraction

Video

Genetic algorithms

Detection and tracking algorithms

Medicine

Visualization

Internet technology

Back to Top