A generic tool to identify and extract regions of text by analyzing connected components
Everyone who has to deal with document image analysis.
Layout analysis is the first major step in a document image analysis workflow. The correctness of the output of page segmentation and region classification is crucial as the resulting representation is the basis for all subsequent analysis and recognition processes.
The system identifies and extracts regions of text by analyzing connected components constrained by black and white (background) separators. The rest is filtered out as non-text. First, the image is binarized, any skew is corrected and black page borders are removed. Subsequently, connected components are extracted and filtered according to their size (very small components are filtered out).
Any Posix compliant system