Extraction of Features with Unstructured Representation From HTML Documents

Ray R. Hashemi, Charles W. Ford, Tim Vamprooyen, John R. Talburt

Research output: Contribution to book or proceedingChapter

Abstract

The goal of this research effort is to mine web pages for unstructured features (Names, Titles, and their associations.) Unstructured features are not easily identifiable because they lack the presence of obvious patterns in their ASCII representations. In addition, the crucial process of establishing associations among the extracted features adds another level of complexity to the mining process. The results obtained from the application of our methodology to a test bed of 20 URLs with 500 total pages revealed: (a) the measures of recovery and accuracy of the extracted Name and Title features are quite satisfactory, and (b) the proposed methodology is highly effective.
Original languageAmerican English
Title of host publicationProceedings of the International Conference on WWW/Internet
StatePublished - Nov 2002

Disciplines

  • Engineering
  • Computer Sciences

Keywords

  • Extraction
  • Features
  • HTML documents
  • Mining names from text
  • Structured features
  • Text mining
  • Unstructured features
  • Unstructured representation
  • Web mining

Fingerprint

Dive into the research topics of 'Extraction of Features with Unstructured Representation From HTML Documents'. Together they form a unique fingerprint.

Cite this