When you use your smartphone to photograph a landscape, the device’s orientation is detected, and the final picture is well-oriented. In other words, regardless of how you hold your phone, whether upside down or sideways, causing the camera sensor to be oriented differently, the resulting photo will automatically adjust to ensure that the sky is at the top.
But, when you use your smartphone to scan a document, it’s often laid flat, and you position your device above it. With your device parallel to the ground, the sensors can’t detect the device’s rotation when you turn your smartphone from portrait to landscape. This results in your scanned document appearing sideways or upside down. Historically, you’d need to rotate the image manually, a typical process with our competitors and even standard photo apps.
Recognizing the frustration caused by this common scanning hiccup, we embarked on a mission to simplify this process for our users. Our goal was clear: develop an algorithm that could automatically determine the correct orientation of a document solely from the image’s content, eliminating the need for manual adjustments without slowing down the app with new algorithms, as everything we create is embedded into Genius Scan and doesn’t rely on external servers for privacy reasons.
The Quest for a Solution
As we delved into the challenge, we first explored existing options but found them lacking for our needs. Some solutions relied on text recognition to determine the document’s orientation, but this method wasn’t foolproof. It didn’t work well with all document types, especially those without printed text, and was too slow to be used without degrading the user experience.
The Choice of Deep Learning
In pursuing a more viable solution, we turned to Deep Learning, a field of Artificial Intelligence based on Artificial Neural Networks that is behind a vast majority of the most recent advances in Computer Vision (Object Detection, Face Recognition, Self-driving cars, etc.). This choice promised accuracy and speed, critical factors for delivering a seamless user experience.
Our Auto-Rotate feature has to operate swiftly, processing large images in the blink of an eye. To achieve this, we downsize these large images to tiny ones (224x224) and feed them to the neural network that will automatically rotate the large images seen by the user if needed.
The Training in Details
Training a Deep Learning model requires computing power and specific tools. We chose to use the external infrastructure of AWS and the Keras open-source library for their ease of use and efficiency. For the neural network architecture, we employed MobileNet - a model designed by Google’s researchers. This is a simple but efficient and not very computationally intensive neural network adapted for mobile vision applications constrained by computing power limitations.
We adopted a common strategy used in Computer Vision, which is to use a model previously trained on a large dataset named ImageNet. This public database contains over a million images annotated with 10,000 categories, such as strawberry or balloon. We substituted these 10,000 network outputs with 4 outputs representing the four potential orientations of a document (0°, 90°, 180°, 270°).
To achieve this new goal, we fine-tuned the model using an internal dataset of hundreds of thousands of documents representing typical use cases for Genius Scan.
At first, we ensured that all the documents were correctly oriented. We also used Tesseract, an open-source text recognition library, to correct the orientation of some documents. Then, we rotated all the images in all directions to give us a data set four times larger, labeled with their correct orientation.
The Pursuit of Perfection
The journey wasn’t without its challenges. We continually refined the formula, scrutinizing test results until we achieved the desired level of accuracy. Once satisfied, we integrated the solution into the Genius Scan app and our SDK. We settled on using TensorFlow Lite to execute the neural network since we share our image processing code between Android and iOS (written in C++) and already use it for our document detection algorithm. Importantly, we didn’t want to increase our app’s size too much. So, we applied a quantization technique to replace floating points (FP32) with integers (INT8) inside the network to reduce its size by 60% (from 1.6 Mo to 676 Ko) without sacrificing the accuracy of the predictions.
The End Feature Hidden in Plain Sight
The result is a feature that may seem unassuming within the app’s interface as it works invisibly but is handsomely practical. Relying on advanced technology, specifically Deep Learning, Genius Scan’s Auto-Rotate feature saves users valuable time. It exemplifies our commitment to enhancing the user experience, even in the subtlest of ways.
We stand dedicated to innovation, ensuring Genius Scan remains intuitive, quick, and convenient, and Auto-Rotate is just one example of our relentless pursuit to simplify your document scanning journey.
Would you like to know how other features work as well?