Implementing a Large Scale Document Storage Service

In this post, we explain why we decided to work on a document storage service for Genius Scan and how we are approaching this engineering challenge. We think that the process is worthy of being shared and we also hope to get feedback from the community.

A little background

At The Grizzly Labs, we create helpful productivity apps. Our main product, Genius Scan, is a scanner for iOS and Android. It’s one of the most popular mobile apps, netting more than 15 million downloads since its debut in June 2010 — we never invested a cent in advertising — and constantly receiving great feedback.

Genius Scan is at the interface between the physical and the digital world: it enables you to snap a photo of a paper document, to correct any perspective distortion and to process it like a real scanner. With several scans, you can build multi-page PDF documents. Genius Scan is also particularly good at exporting them wherever you want (Box, Dropbox, Email, Evernote, Expensify, Google Drive, OneNote, OneDrive, FTP, WebDAV…)

While we expected users to use Genius Scan to scan and export documents, we noticed that a lot of them are using Genius Scan to store their scans for the long term.

Solving problems for our users

Genius Scan stores documents on the phone in the local app folder. Due to the app sandboxing, this folder is only accessible to Genius Scan and documents are not synced between different devices. They also never leave the phone unless you decide to export them (the exception being your automatic iOS backups). This has two advantages: simplicity of implementation and privacy.

Yet, after 4 years of answering support emails, we are clearly seeing patterns in user requests:

Backup. Users can lose documents for many reasons: accidental deletions, improper install of new OS updates, phone lost or stolen. The unsatisfying workaround is to restore a complete phone backup from iCloud, provided that users have any.
Device migration (transfer of documents from an older device to a newer one). The problem arises during cross-platform transfers, typically between iOS and Android. There is no obvious way for users to access the documents they created on the previous platform. The workaround is to export the documents to a cloud service such as Dropbox. Users can later access them within the Dropbox app on the new device. We experimented with a backup and restore system for the Android app, but it’s more a patch than a satisfactory solution. This issue can arise as well when upgrading iOS devices if the user doesn’t have device backups setup.
Synchronization (accessing documents from any device). Users scan documents on their phones and want to read them afterwards on a tablet. The only way to do this currently is to export the documents to a service such as Dropbox and view them on the tablet within the Dropbox app. Likewise, users would want the documents to magically appear on their computer.
Extra features which require processing power or a backend. Examples of such features are: OCR, OCR indexing (similar to what Evernote does), document signing and sharing.

All this feedback keeps showing us the need for a document storage backend with synchronization logic.

Self-imposed requirements

One of the main challenges we face is that Genius Scan is already at large scale. Our millions of monthly active users scan about 250,000 new documents every day and have already generated more than 200 million scans. While we can find ways to stagger the launch, we can’t build a simple architecture that we would subsequently scale.

Looks like Mondays are scanning days, and that nobody scans turkeys :)

Our backend needs to meet the following requirements:

Security and privacy. Obviously, the first feature of this backend.
Data integrity. Users store very important documents in Genius Scan (losing receipts can translate into thousands of dollars that cannot be expensed.) There cannot be any loss of document as a consequence of users turning on synchronization.
Ability to handle network conditions. Mobile devices can go offline for a long time, and then pop back online (eg.: a user scans documents during a flight)
Cross-platform. We need to support iOS, Android and potentially other platforms.
Optionality. Users don’t have to use the synchronization, particularly if they have privacy concerns.
Financial sustainability. we plan to be able to cover the costs associated with this backend for the long run. A simple spreadsheet and a few estimations are enough to understand the cost structure. In our case, the monthly storage cost (S3) will dominate.

Different options

Synchronization is a typical problem and there are various alternatives:

With iCloud, iCloud Drive, no server-side work would be required. However, backup, device migration are not addressed, and synchronization only partially. One of our worries with iCloud was its instability, but it overcame this bad reputation with iOS 7 and 8.
CloudKit is also a potential solution. Given our scale, though, it won’t be a good fit due to the storage limits. On top of that, it’s limited to iOS.
Cross-platform cloud storage services: Dropbox, Box, OneDrive… Using these services as a backend tackles backup, device migration and synchronization and we wouldn’t be reinventing the wheel. There are several drawbacks: we need to adapt to their synchronization model — with Dropbox: the sync unit is the file — and we also become dependent on a third-party platform. Last but not least, any extra features (indexing, search, OCR) will be harder to offer since we won’t manage the document storage.
Ensembles developed by Drew McComack is an impressive project and could be an appropriate solution. However, it only synchronizes objects stored in Core Data: it wouldn’t be trivial to synchronize the scans themselves. It also doesn’t support Android (that being said, Ensembles 1.x is an open source project and could be extended.) Last, the cloud is used for temporary transfer, and not for data storage (peer-to-peer synchronization model) so it wouldn’t be usable for backup and extra features.
Our own service. This would address the different feature requests although the main drawback is that we would be reinventing the wheel.

After meeting with several top enterprise companies in San Francisco and in France, we decided to settle on implementing our own service due to the drawbacks of other solutions.

Steps we’ve taken

It’s crucial for us to identify trouble early. However, it’s easy to keep listing potential issues and corner cases and postpone the implementation, so we decided to take iterative steps towards our goal.

Metrics

We are estimating the load that our millions of MAU would generate on our backend through a couple channels: metrics and surveys. They are a lot of options for metrics. As for surveys, we can query users by displaying an in-house banner ad in Genius Scan. We also ask people who contact us for support.

An important metric is how many documents a typical user (vs. a power user) has. We are also interested in estimating the peak rate of document scanning during a day.

On top of that, surveying users enables us to understand the financial sustainability: Are users ready to pay for such a service? Should we offer a limited trial to everyone? etc.

Using the Dropbox SDK in our Android app

To get started, we wanted to put together a version of Genius Scan with some kind of synchronization. We first implemented a simple prototype by modifying the Android version of Genius Scan to use the Dropbox Sync API. This was fairly quick and took a couple day for a rough implementation.

Having such a prototype had several benefits: first, it gave us some self-confidence regarding the feasibility of this project. Second, it let us play with something concrete in order to think about the UX of the client apps; and finally, this was a first step in identifying potential issues on Android and in general (what are the typical synchronization conflicts, what behavior would we expect as users).

Custom implementation

Our next step was to develop our own client-server solution. This time, we focused on the iOS app. We developed a straightforward backend using Sinatra. We covered this implementation with extensive specs in iOS and Ruby for the server.

We based our architecture on Evernote’s synchronization. The brain of the synchronization is located in the client and the server holds the state. In some ways, the model structure is similar: In Genius Scan, documents contains pages, and can be optionally tagged. In Evernote, notebooks contain notes and can be optionally tagged.

Our client implementation fits nicely with the existing Core Data code Genius Scan relies on. It’s also surprisingly not intrusive: all the synchronization happens in a different managed object context and the contexts use decoupled notifications to communicate.

The goal of this iOS prototype was to deepen our understanding of the synchronization. The first iteration of this implementation took a bit less than a week to develop without the conflict handling.

We are refining this early implementation (we plan to write a more technical blog post on the subject). Note that it still focuses a lot on the client side and that our server component is running on fast local network. One of the next moves will be to test this implementation under real network conditions including latency and errors.

Conclusion

The two of us are super excited to be working on this project that we had postponed for a while. We want to provide this service to our users as soon as we can; however, we are not in a hurry. Since we don’t have any pressure, we can take our time to get it right.

If you’ve read until here, we are also very interested in your feedback (Hacker News). In any case, rest assured that we will keep you updated on our progress.

A little background

Solving problems for our users

Self-imposed requirements

Different options

Steps we’ve taken

Metrics

Using the Dropbox SDK in our Android app

Custom implementation

Conclusion

Recent posts

Announcing Genius Scan SDK 6

Spring Cleaning with Genius Scan

How Companies Benefit from Genius Scan for Teams

Stop Losing Receipts: Build a Personal Expense Archive in Minutes