Comparative Study of Text Representation Methods


Several text representation methods, such as bag- of-words and N-gram models, have been widely used in natural language processing, text mining, web data analysis, and so on. The bag-of-words representation can be simply implemented and provide high performance. But it becomes complicated to process documents in oriental languages, since intrinsic separators are not useful in this case. The N-gram representation can be applied to process different languages, whether there are separators or not. It processes documents by moving a window through them by character. Some problems, such as sparseness and zero frequency problem, are still not solved in N-gram model. We proposed a pattern representation scheme using data compression (PRDC) in our former study. The PRDC method does not only independently process text data, but also processes multimedia data effectively. In this study, we will introduce the proposed approach and compare it to the aforementioned two text representation methods. The performance will be compared in terms of clustering ability. Based on the experiment results we will analyze the text representation methods.

  • Abstract
  • Keywords
  • Introduction
  • PRDC and Comparison Process
  • Experiments and Results
  • Discussions
  • Conclusion
  • References

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In