In this article, we'll detail the methodologies behind the crafting of a Proof of Concept (POC) workflow, tailored for ID card text extraction. This solution was developed using five open source deep learning models implemented in Python.
We will offer insights into the use of instance segmentation, classification and Optical Character Recognition (OCR) techniques.
In a world that's rapidly going digital, the ability to extract information quickly and accurately from physical documents is becoming indispensable. Whether it's for customer onboarding in the banking sector, verifying identity in online services, or streamlining administrative tasks in various industries, ID card information extraction plays a pivotal role.
But as anyone who has manually entered data can attest, manual extraction is prone to errors, tedious, and time-consuming. With advances in machine learning and Computer Vision, we now have the tools to automate this process, making it faster, more accurate, and adaptable to a wide range of ID card formats.
This work was conducted by Ambroise Berthe, an R&D Computer Vision Engineer at Ikomia, in early 2022. The insights shared in this article draw inspiration from his comprehensive report.
The solution we've designed comprises a series of independent open source algorithms capable of:
In this solution, we will delve into the components that make up the identity documents reading system. For this POC, our goal is to fine-tune the algorithms for several document variants, including:
Nevertheless, with the right dataset, this solution can be tailored to accommodate any kind of document.
In today's world, the most effective methods for creating algorithms to perform complex tasks are based on deep learning. This supervised technique requires a substantial amount of reliable data to operate accurately. Therefore, dataset creation was the first step in this project.
Multiple datasets were essential for this project, as several different task models are involved.
We needed a database to train a model capable of segmenting identification documents within an image. We chose segmentation over simple detection to precisely extract the image area for processing. This decision was crucial as the documents photographed might contain extra text that could have disrupted the subsequent algorithm steps:
For each image a file was produced, detailing the class and the polygon outlining each identification document. This dataset comprises approximately 100 images per document type, totaling nearly 1100 annotated images.
From this initial dataset, we cropped and straightened all the images. This created the image database that would be annotated for OCR and Key Information Extraction (KIE).
This image set was also used to train the model to straighten the images so that the text is horizontal.
We decided to annotate text at the word level rather than at the sentence level. While annotating at the word level is more time-consuming, it allows for easier subsequent manipulation of the database. Specifically, it's simpler to merge detections than to split one.
This involves assigning to every word in the image:
This project incorporates five open source deep learning models implemented in Python. Choosing the right algorithm for each task is crucial. Apart from an algorithm's inherent performance, it must also be compatible with others, both in terms of input-output nature and of the execution environment.
Since Ikomia's AI team continually monitors scientific advancements in the field of Computer Vision, we keep track of the latest top-performing algorithms for each application domain (classification, object detection, segmentation, OCR, etc.).
For this task, we tested two algorithms, each offering a balance between execution time and performance. We aimed for real-time responsive algorithms to ensure that the project's entire process remains efficient.
SparseInst offers a slightly faster model, but its performance is far inferior to YoloV7. We favored the latter since this step forms the foundation of the entire process and requires high precision.
Here, we employed two variants of ResNet. The first model identifies the four corners of an ID from its segmentation mask. The second determines the document's orientation—whether it's tilted at 0°, 90°, 180°, or 270°.
We favored ResNet's, over other algorithms, such as Deskew and OpenCV, because ResNet offers a highly adaptable architecture, producing compact models tailored to our specific needs.
Numerous text detection models exist, so we focused on those from prestigious conferences, such as on DBNet and DBNet++ (TPAMI'2022), Mask R-CNN (ICCV'2017), PANet (ICCV'2019), PSENet (CVPR'2019), TextSnake (ECCV'2018), DRRG (CVPR'2020) and FCENet (CVPR'2021).
We settled on DBNet and DBNet++ for their good performance on datasets with irregular texts. While DBNet++ is an improved version of DBNet and offers a slight performance boost, it's not compatible with ONNX conversion, which optimizes the model during deployment. Thus, we use the DBNet model.
Like for text detection, we selected and tested two text recognition models. Even though ABINet was published two years after SATRN, the latter simply outperforms the former in terms of speed and accuracy on irregular texts, like those found on ID photos. For this reason, we selected SATRN as our text recognition algorithm implemented in the OpenMMLab framework.
We chose SDMG-R for its smooth integration with SATRN. Using spatial relationships and the features of detected text regions, SDMG-R employs a dual-modality graph-based deep learning network for end-to-end KIE. This enables it to classify and link words effectively.
After selecting and implementing the algorithm, we carried out multiple training sessions using the database created earlier, retaining the best results. The training was conducted on a powerful computation server equipped with GPUs, essential for achieving operations within a reasonable time frame.
We obtained excellent precision metrics, making this model a robust foundation for subsequent steps. Once trained, the model can now predict a binary segmentation mask and the document type.
While the segmentations are accurate, it's still relevant to smoothen these segmentation masks. Typically, under normal shooting conditions, the photographed documents are quadrilaterals. Thus, we developed a deep learning model inspired by ResNet capable of converting any segmentation mask into a quadrilateral.
We trained this model unsupervised, using automatically generated binary masks.
At this program stage, it can detect documents in an image and provide the coordinates of its four corners. By determining the smallest bounding rectangle around the four points of an ID, we can then crop the image to obtain well-defined ID snippets.
However, these snippets might not always be oriented correctly, so the text could be upside down instead of running horizontally from left to right. The next model we implemented, largely inspired by ResNet, determines the ID's orientation. This means that the model predicts a number between 0 and 3, indicating how many 90° rotations are required to straighten the text.
We trained this model using the cropped image database from the previous algorithms, which we manually straightened. We used 80% of the images for training, and the remaining 20% were used for evaluation. The achieved score was 100%.
After implementing MMOCR from OpenMMlab into Ikomia, we trained DBNet on a comprehensive dataset, covering all document types. The aim was to develop a singular text detection model suitable for all identification document types.
The highest score (known as h-mean) achieved post-training was 0.86. This is particularly commendable, especially when considering the reading challenges presented by some documents and the scores achieved by competitors at the ICDAR2019 competition.
Similar to text detection, we decided to reduce the number of models in use by training a single model for all document types.
The obtained scores are quite good, but it's evident that approximately 1 in 11 words will be misspelled. Furthermore, the longer the word, the higher the likelihood of it containing an error.
We attempted to consolidate the extraction of primary information into a single model, but it proved inconclusive. The chosen KIE model, SDMG-R, might require training on a broader dataset for diverse document types. Contrary to previous approaches, we later trained a distinct model for each document type, which yielded better results.
The achieved scores vary depending on the document types. It's also worth noting that scores are deliberately rounded to the nearest 5% since only 20% of the dataset is used for evaluation, representing at most 20 samples (this can vary based on the document type).
We see that the algorithm excels with certain documents (e.g., ID cards) but struggles with others (e.g. old driving license). The challenges with old driving licenses could be due to:
Additionally, the evaluation is based on perfect bounding boxes and character strings, not on the predictions made by the text detection and text recognition algorithms. Therefore, an evaluation under real-world conditions would likely yield different results with slightly lower performance, but this would not allow for an assessment of SDMG-R alone.
The SDMGR model assigns a class to each word. The next step often involves merging boxes when the sought field consists of multiple words. We use a function to groups words based on their class, text, and geometry inspired by the stitch_boxes_into_lines function from mmocr.
It's common for SDMG-R to make mistakes and predict multiple solutions for the same field. To address these scenarios, we established as natural systematic rules as possible. This step also standardizes the algorithm's output according the application case.
At this stage, all algorithms involving Deep Learning are executed using the Python framework PyTorch. Designed for Deep Learning researchers, PyTorch excels in GPU model training but isn't optimized for CPU inference.
Anticipating potential deployment on such an architecture, we chose to convert the most resource-intensive models into the ONNX format. This format, coupled with its inference engine, can potentially reduce model size and computation times while minimizing efficiency loss. This conversion, when feasible, offers speed gains ranging from 1.5 to 2 times.
For this solution, we chose to optimize only the text detection and recognition models only. Combined, these two models account for over 90% of the entire algorithm's computation time.
We subsequently combined all the described algorithms into a single program in the form of an Ikomia algorithm. This algorithm can be used with Ikomia STUDIO and Ikomia API.
The output generated by our algorithm is provided in JSON format and contains a list with an information dictionary for each detected ID in the image.
The solution we're proposing was assessed on approximately 30 images for each document type.
The goal of this section is to apply the algorithm to each image in the evaluation dataset, save the result in a file, and then compare it with the ground truth. Comparisons will be made between character strings. We propose two ways to score a comparison between character strings:
From this POC our solution gave an overall strict score of 68% and NED score of 82%. This suggests that if our algorithm was initially deployed as a pre-fill tool, a human verifier would only need to correct, on average, less than 20% of the entered characters.
Examining the outcomes, it becomes apparent that the scores differ significantly based on the document type and the specific field being extracted (not shown).
In general, fields related to locations and first names score lower than others. These fields often contain more characters, increasing the likelihood of text recognition errors. On the other hand, fields like dates or numbers (excluding the front of old licenses) showcase commendable scores.
For instance, the date of birth field on a passport yielded a Strict score of 0.93 and a NED score of 0.97 based on 30 samples. In contrast, the third 'first name' on the new French ID card had a Strict score of 0.23 and a NED score of 0.42 from 13 samples.
At the end of this prototyping phase, analyzing the evaluation results leads to the following observations:
Based on these findings, we suggest continuing the algorithm's development with a second R&D sequence to address the current weaknesses of the solution on certain type documents. This phase would incorporate the findings of this study to implement improvements addressing the identified challenges.
The primary focus will involve refining the annotated database and making necessary adjustments in different stages of the processing chain.
The solution developed by Ikomia for this project demonstrates the potential of AI in automating repetitive tasks and improving efficiency in various processes.
The ID card information extraction solution was developed using open source algorithms implemented in Python and available on Ikomia HUB. Using the open source training algorithms, we effortlessly trained our custom deep learning models, managing all parameters with just a few lines of code.
Although we employed algorithms from various frameworks like TorchVision, YOLO, and OpenMMLab, the Ikomia API seamlessly chained the inputs and outputs of the inference models used to build the custom solution.