Abui Verb Clustering

This is a simple tool to visualize Abui verbs (using Principal Component Analysis) and their classification using K-means classifier. Different clusters have different colours - size corresponds to cluster size. The analysis of principal components can benefit the linguistic description by indicating which morphological properties are most significant for the clustering and should be documented with priority.

There are two datasets. The so-caller Currated data set and Olga's data set. Mostly, they contain the same verbs. Nevertheless, their representations differ. The data from currated data set is classified using K-means classifier. Then, a Naive Bayes classifier is trained on this classification. Using this Naive classifier, the data from Olga's data set were classified using Naive Bayes classified.

The data from currated set are very unbalanced. It contains 351 verbs but 86 unique representations. More than 60% of verbs correspond just to two different representations.

Learning set

Corpus data

Number of clusters:

Scroll to zoom the 3D picture, use mouse to rotate it.

If there are more verbs in the centre of the cluster, the first in the alphabetical ordering is selected as cluster representative

Bold verbs are correctly listed in the same cluster. Italic-printed verbs are included in the corpus in another cluster. Verbs printed in normal font are not included in the second dataset.

Data