ePrints@IIScePrints@IISc Home | About | Browse | Latest Additions | Advanced Search | Contact | Help

Change Rate Estimation and Optimal Freshness in Web Page Crawling

Avrachenkov, K and Patil, K and Thoppe, G (2020) Change Rate Estimation and Optimal Freshness in Web Page Crawling. In: 13th EAI International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS 2020, 18-20 May 2020, Tsukuba; Japan, pp. 3-10.

[img]
Preview
PDF
ACM_INT_CON_PRO_SER_3-10_2020.pdf - Published Version

Download (1MB) | Preview
Official URL: https://dx.doi.org/10.1145/3388831.3388846

Abstract

For providing quick and accurate results, a search engine maintains a local snapshot of the entire web. And, to keep this local cache fresh, it employs a crawler for tracking changes across various web pages. However, finite bandwidth availability and server restrictions impose some constraints on the crawling frequency. Consequently, the ideal crawling rates are the ones that maximise the freshness of the local cache and also respect the above constraints. Azar et al. 2 recently proposed a tractable algorithm to solve this optimisation problem. However, they assume the knowledge of the exact page change rates, which is unrealistic in practice. We address this issue here. Specifically, we provide two novel schemes for online estimation of page change rates. Both schemes only need partial information about the page change process, i.e., they only need to know if the page has changed or not since the last crawled instance. For both these schemes, we prove convergence and, also, derive their convergence rates. Finally, we provide some numerical experiments to compare the performance of our proposed estimators with the existing ones (e.g., MLE). © 2020 ACM.

Item Type: Conference Paper
Publication: ACM International Conference Proceeding Series
Publisher: Association for Computing Machinery
Additional Information: cited By 0; Conference of 13th EAI International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS 2020 ; Conference Date: 18 May 2020 Through 20 May 2020; Conference Code:160415
Keywords: Search engines; Websites, Convergence rates; Finite bandwidth; Numerical experiments; On-line estimation; Optimisation problems; Partial information; Rate estimation; Tractable algorithms, Web crawler
Department/Centre: Division of Electrical Sciences > Computer Science & Automation
Date Deposited: 05 Jan 2021 06:02
Last Modified: 05 Jan 2021 06:02
URI: http://eprints.iisc.ac.in/id/eprint/65728

Actions (login required)

View Item View Item