Discovery

共同研究先：Kasetsart UniversityAcademic 共同研究数 4

Conference Paper　2018 4 19　IEEE : Institute of Electrical and Electronics Engineers

History-enhanced focused website segment crawler（Last author）

履歴を重視したウェブサイトセグメントクローラー

Tanaphol Suebchua, Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana
【抄録】The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%. © 2018 IEEE.
【抄録日本語訳】フォーカストクローリング研究の主要な課題は、特定のトピックに関連するウェブページをできるだけ多く見つけるために、帯域幅、ディスクスペース、時間などの計算機資源をいかに効率よく利用するかということである。この課題を解決するために、我々は以前、ウェブサイトセグメントと呼ばれる同じディレクトリパスに位置する関連ウェブページ群をクロールすることを目的とした機械学習ベースのフォーカスクローラーを導入し、これまでに高い効率を達成しています。我々の以前のアプローチの限界の一つは、学習データセット中の関連するウェブサイトセグメントと同じリンク特性を持つウェブサイトがある場合に、関連するウェブサイトセグメントを提供していないウェブサイトを繰り返し訪問してしまう可能性があることである。本論文では、この問題を解決するために、「履歴拡張型フォーカスウェブサイトセグメントクローラー」を提案する。その背後にある考え方は、クローラがウェブサイトから連続して多くの無関係なウェブページをダウンロードした場合、未訪問のウェブサイトセグメントの優先順位のスコアを下げるべきであるというものである。この考え方を実現するために、我々は、最近のクローリング結果、すなわち、対象Webサイトから収集した関連・無関係のWebページから抽出する「履歴特徴量」と呼ばれる新しい予測特徴量を提案する。実験の結果、新たに提案した特徴量により、着目したクローラのクローリング効率を最大で約5%向上させることができた。© 2018 IEEE.

Article　2018 4 1　Springer

Efficient Topical Focused Crawling Through Neighborhood Feature（Last author）

近隣特徴による効率的なトピックフォーカスクローリング

Tanaphol Suebchua, Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana
New Generation Computing
【抄録】A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused crawling research is the prioritization of unvisited web pages in the crawling frontier followed by crawling these web pages in the order of their priority. The most common feature, adopted in many focused crawling researches, to prioritize an unvisited web page is the relevancy of the set of its source web pages, i.e., its in-linked web pages. However, this feature is limited, because we cannot estimate the relevancy of the unvisited web page correctly if we have few source web pages. To solve this problem and enhance the efficiency of focused web crawlers, we propose a new feature, called the “neighborhood feature”. This enables the adoption of additional already-downloaded web pages to estimate the priority of a target web page. The additionally adopted web pages consist both of web pages located at the same directory as that of the target web page and web pages whose directory paths are similar to that of the target web page. Our experimental results show that our enhanced focused crawlers outperform the crawlers not utilizing the neighborhood feature as well as the state-of-the-art focused crawlers, including HMM crawler. © 2017, Ohmsha, Ltd. and Springer Japan KK, part of Springer Nature.
【抄録日本語訳】フォーカス型ウェブクローラは、一般的なBreadth-FirstやDepth-Firstクローラよりも効率的であるため、国内のウェブコーパスや垂直検索エンジンなどで用いられるドメイン固有のデータ収集に不可欠なツールである。フォーカス・クローリングの研究において問題となるのは、クローリングフロンティアにおいて未訪問のWebページに優先順位をつけ、優先順位の高いWebページから順にクローリングしていくことです。多くのクローリング研究で採用されている、未訪問のウェブページを優先するための最も一般的な特徴は、そのソースウェブページの集合、すなわち、そのインリンクウェブページの関連性である。しかし、リンク元ページが少ない場合、未訪問ページの関連性を正しく推定することができないため、この機能には限界があります。この問題を解決し、重点的なWebクローラの効率を向上させるために、我々は「近傍性機能」と呼ばれる新しい機能を提案する。これは、対象ウェブページの優先度を推定するために、既にダウンロードされているウェブページを追加で採用することを可能にするものである。追加採用されるウェブページは、対象ウェブページと同じディレクトリにあるウェブページと、対象ウェブページとディレクトリパスが類似しているウェブページの両方である。実験の結果，本手法を用いたクローラは，HMMクローラを含む最新のクローラや，近傍探索機能を用いないクローラを上回る性能を示すことが確認できた．© 2017, Ohmsha, Ltd. and Springer Japan KK, part of Springer Nature.

Conference Paper　2016 12 16　IEEE : Institute of Electrical and Electronics Engineers

Adaptive Focused Website Segment Crawler（Last author）

適応型フォーカスウェブサイトセグメントクローラー

Tanaphol Suebchua, Arnon Rungsawang, Hayato Yamana
【抄録】Focused web crawler has become indispensable for vertical search engines that provide a search service for specialized datasets. These vertical search engines have to collect specific web pages in the web space, whereas search engines such as Google and Bing gather web pages from all over the world. The problem in focused crawling research is how to collect specific web pages with minimal computing resources. We previously addressed this problem by proposing a focused crawling strategy, which utilizes an ensemble machine learning classifier to find the group of relevant web pages, referred to as relevant website segment. In this paper, we enhance the proposed crawler as follows: 1) We increase the accuracy of predicting website segments, by preparing two predictors: a predictor learned by features extracted from relevant source website segments and another predictor learned by features from irrelevant ones. The idea is that there may exist different characteristics between these two types of source website segments. 2) We also propose a noisy data elimination method when updating the predictor incrementally during the crawling process. A preliminary experiment shows that our enhanced crawler outperforms a crawler that equips neither of these approaches by around 12%, at most. © 2016 IEEE.
【抄録日本語訳】特化したデータセットの検索サービスを提供するバーティカルサーチエンジンにとって、フォーカス型ウェブクローラは不可欠なものとなっている。Google や Bing などの検索エンジンが世界中から Web ページを集めるのに対して、これらのバーティカルサーチエンジンは Web 空間内の特定の Web ページを収集する必要がある。フォーカスクロールの研究では、最小限の計算資源でいかに特定のウェブページを収集するかが問題である。我々は以前、アンサンブル機械学習分類器を用いて、関連するWebページ群（関連Webサイトセグメントと呼ぶ）を発見するフォーカスクロール戦略を提案し、この問題に対処した。本論文では，提案するクローラーを以下のように改良する．1) ウェブサイトセグメントの予測精度を向上させるため、関連するソースウェブサイトセグメントから抽出した特徴量を用いて学習した予測器と、関連しないウェブサイトセグメントから抽出した特徴量を用いて学習した予測器の2つの予測器を用意する。この2つのタイプのソースウェブサイトセグメントには異なる特徴が存在する可能性があるという考え方である。2)また，クロール中に予測器を段階的に更新する際に，ノイズを除去する方法を提案する．予備実験の結果、我々の拡張クローラは、これらのアプローチのいずれも備えていないクローラに対して、最大で約12%の性能を発揮することが分かった。© 2016 IEEE.

Article　2011 4　Springer

Time-weighted web authoritative ranking（Last author）

時間加重ウェブ権威性ランキング

Bundit Manaskasemsak, Arnon Rungsawang, Hayato Yamana
Information Retrieval
【抄録】We investigate temporal factors in assessing the authoritativeness of web pages. We present three different metrics related to time: age, event, and trend. These metrics measure recentness, special event occurrence, and trend in revisions, respectively. An experimental dataset is created by crawling selected web pages for a period of several months. This data is used to compare page rankings by human users with rankings computed by the standard PageRank algorithm (which does not include temporal factors) and three algorithms that incorporate temporal factors, including the Time-Weighted PageRank (TWPR) algorithm introduced here. Analysis of the rankings shows that all three temporal-aware algorithms produce rankings more like those of human users than does the PageRank algorithm. Of these, the TWPR algorithm produces rankings most similar to human users', indicating that all three temporal factors are relevant in page ranking. In addition, analysis of parameter values used to weight the three temporal factors reveals that age factor has the most impact on page rankings, while trend and event factors have the second and the least impact. Proper weighting of the three factors in TWPR algorithm provides the best ranking results. © 2010 Springer Science+Business Media, LLC.
【抄録日本語訳】我々は，ウェブページの権威性を評価する際の時間的要因について調査している．時間に関する指標として，年齢，イベント，トレンドの3つを提示する．これらの指標はそれぞれ，最新性，特別なイベントの発生，改版の傾向を測定する．実験用データセットは，選択したウェブページを数ヶ月間クロールすることによって作成される．このデータを用いて，人間のユーザによるページのランキングと，時間的要因を含まない標準的な PageRank アルゴリズム，および今回紹介する Time-Weighted PageRank (TWPR) アルゴリズムを含む時間的要因を組み込んだ 3 つのアルゴリズムによって計算されるランキングを比較した．ランキングを分析した結果、3つの時間考慮アルゴリズムはすべて、PageRankアルゴリズムよりも人間のユーザーのランキングに近いものを生成することがわかった。このうち，TWPRアルゴリズムは人間のユーザと最も近い順位を生成しており，3つの時間的要因がページランキングに関連していることがわかる．また、3つの時間的要因の重み付けに使用したパラメータ値を分析した結果、年齢要因がページランキングに最も影響を与え、トレンド要因とイベント要因は2番目と最も影響が少ないことが分かりました。TWPRアルゴリズムで3つの要素を適切に重み付けすることで、最良のランキング結果が得られる。© 2010 Springer Science+Business Media, LLC.