Abstract
The goal of this research effort is to mine web pages for unstructured features (Names, Titles, and their associations.) Unstructured features are not easily identifiable because they lack the presence of obvious patterns in their ASCII representations. In addition, the crucial process of establishing associations among the extracted features adds another level of complexity to the mining process. The results obtained from the application of our methodology to a test bed of 20 URLs with 500 total pages revealed: (a) the measures of recovery and accuracy of the extracted Name and Title features are quite satisfactory, and (b) the proposed methodology is highly effective.
Original language | American English |
---|---|
Title of host publication | Proceedings of the International Conference on WWW/Internet |
State | Published - Nov 2002 |
Disciplines
- Engineering
- Computer Sciences
Keywords
- Extraction
- Features
- HTML documents
- Mining names from text
- Structured features
- Text mining
- Unstructured features
- Unstructured representation
- Web mining