Hybrid Approach for Extractive Text Summarization of Indonesian News Articles using Machine Learning and Heuristic Features
Authors
| Issue | 2026 |
| Published | 14 May 2026 |
| Section | Articles |
Abstract
The rapid growth of Indonesian digital news content highlights the need for effective automated summarization methods tailored to morphologically rich, low-resource languages. This study proposes a linguistically informed hybrid approach for extractive text summarization designed specifically for Indonesian language characteristics. The framework integrates machine learning classification with carefully engineered linguistic features to improve summary relevance while maintaining computational efficiency. The methodology combines Logistic Regression and TF-IDF vectorization with additional heuristic features, including positional weighting, keyword relevance, and sentence length scoring. The system is evaluated on a dataset of 750 Indonesian news documents (10,159 sentences) annotated by three linguistic experts and covering multiple news domains to evaluate cross-domain behavior. Experimental results show that the proposed approach achieves 82.53% classification accuracy with a classification F1-score of 0.640. The system also maintains high computational efficiency, requiring only 0.18 seconds per document with a compact 124 MB model size. Summarization quality evaluation further indicates competitive content preservation with a ROUGE-1 F1-score of 0.778. Compared to traditional rule-based baselines, the hybrid system provides a more balanced trade-off between effectiveness and efficiency. Despite these advantages, performance variation across different document structures indicates limitations in handling less structured content, suggesting the need for improved structural adaptability and cross-domain robustness. Overall, this work contributes a practical and linguistically tailored summarization framework that supports scalable deployment for Indonesian digital news processing.
Keywords: text summarization, machine learning, hybrid classification, extractive summarization, low-resource languages
