By 2 years ago
Automatic classification of textual content in an under-resourced language is challenging, since lexical resources and preprocessing tools are not available for such languages. Their bag-of-words (BoW) representation is usually highly sparse and noisy, and text classification built on such a representation yields poor performance. In this project, we explored the effectiveness of lexical normalization of terms and statistical feature pooling for improving text classification in an under-resourced language. We focused on classifying citizen feedback on government services provided through SMS texts which are written predominantly in Roman Urdu (an informal forward transliterated version of the Urdu language). Our proposed methodology performed normalization of lexical variations of terms using phonetic and string similarity. It subsequently employed a supervised feature extraction technique to obtain category-specific highly discriminating features. Our experiments with classifiers revealed that significant improvement in classification performance was achieved by lexical normalization plus feature pooling over standard representations.