Text Detection and Classification of Construction Documents

Narges Sajadfar; Sina Abdollahnejad; Ulrich Hermann; Yasser Mohamed

Abstract:

Large construction projects generate thousands of documents that require a careful management. The classification of documents is an important step in document management and control. Construction documents are generated in different formats, many of which are unstructured and contain drawings and images, which makes the task of document classification and control even more challenging. In this paper, a dataset of 5000 documents is used as a case study. Optical Character Recognition (OCR) bounding boxes are applied to extract text from the set of documents. In the next step, two classification methods are applied. One based on a predefined set of keywords and another based on deep learning long short- term memory (LSTM) network. The challenges of the proposed approaches are discussed in relation to OCR bounding box locations with different document layout and how to obtain a set of representative key words for each class. Initial results of the study are encouraging and show that OCR technique combined with text classification is a powerful method for construction documents control and can reach an accuracy of 92%.