Knowledge-poor approach to constructing word frequency lists, with examples from romance languages

  1. Makagonov, Pavel
  2. Gelbukh, Alexander F.
  3. Alexandrow, Mikhail
  4. Blanco Escoda, Xavier
Aldizkaria:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Argitalpen urtea: 2004

Zenbakia: 33

Orrialdeak: 127-132

Mota: Artikulua

Beste argitalpen batzuk: Procesamiento del lenguaje natural

Laburpena

Word frequency lists extracted from documents are widely used in many procedures of text clustering and categorization. Usually for compilation of such lists morphological-based approaches (such as the Porter stemmer) to join the words having the same base meaning are used. However such an approach needs many language-dependent linguistic resources or knowledge when working with multilingual data and multithematic document collections. We suggest two procedures based on empirical formulae of word similarity. Simple adjustment of the parameters of the formulae allows tuning them to different European languages. We demonstrate the application of o