Automation Distributed Cloud Based Crawler
- DOI
- 10.2991/978-94-6463-525-6_3How to use a DOI?
- Keywords
- crawler; automation; distributed systems; fog-cloud; distributed cloud
- Abstract
- Information is very important data and is needed in various needs. Online news is one type of site that ranks in the top 10 most visited by internet users in Indonesia. Online news sites publish articles to the internet every minute. An online news corpus is necessary for information processing. Retrieval of online news corpus in general has obstacles such as large resource requirements, delays due to excessive access restrictions categorized as bots / spam, thus affecting the speed of retrieval of information from online news. To overcome this problem, it is necessary to develop a framework to improve performance in the creation of an online news corpus. In this study, a framework was developed in creating an online news corpus based on distributed cloud based crawler automation using the MCDM method. The process of self-optimization of cloud tasks in research uses a topsis approach method with alternative data as objects to be assessed, then the task scheduling process of selecting edge nodes in this study will apply the AHP method to get the best alternatives. This framework divides the crawler system and information extraction into several sub-systems. The first stage developed a distributed crawler system, a mechanism for distributing work using a node selection mechanism. The second stage is to develop an information extraction system using a combination of pattern based and node density. The third stage developed automated node management. The contribution of this research is the automation of distributed cloud-based crawler framework which has not been done by previous researchers. This framework activates nodes according to the priority of existing work so that it can speed up the process of retrieving information by using small resources. The performance of this framework will be tested for the accuracy of the extraction results and the average time required. The stages carried out in this research start from URL collection, URL filtering, scheduling, accessing URLs and data extraction. This research focused on the automation of distributed cloud-based crawlers. 
- Copyright
- © 2024 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Lanang Prismana PY - 2024 DA - 2024/10/29 TI - Automation Distributed Cloud Based Crawler BT - Proceedings of the 2023 Brawijaya International Conference (BIC 2023) PB - Atlantis Press SP - 13 EP - 21 SN - 2352-5428 UR - https://doi.org/10.2991/978-94-6463-525-6_3 DO - 10.2991/978-94-6463-525-6_3 ID - Prismana2024 ER -