Film Comment Collection Technology and Realization of Distributed Web Crawler Based on Python

Chao Liu, Nana Nie

Film Comment Collection Technology and Realization of Distributed Web Crawler Based on Python

Download as PDF

DOI: 10.38007/Proceedings.0001066

Author(s)

Chao Liu, Nana Nie

Corresponding Author

Chao Liu

Abstract

Crawler technology is one of the important ways to obtain data efficiently and accurately in the current Internet environment, so it is widely used in various Internet industries. It mainly uses the Python language to access and crawl the HTTP hypertext protocol, URL address and so on in the Web page to complete the automatic crawling of the data information in the website. In the face of large-scale data capture requirements, in order to improve the performance of the overall Web crawler system, distributed technology is needed. This paper proposes a model based on a distributed Web crawler, with Douban as the experimental object. This model effectively improves URL normalization and reduces data repetition rate. The program running results show that the crawler model can accurately crawl more than 100,000 data on the Web page and avoid the problem of duplicate items, so that it can effectively extract massive resources for machine learning and recommendation system experiments.

Keywords

Python; Web Crawler Technology; Data Extraction and Processing