Dropbox recently published how it made the camera upload process for Android faster and more reliable. Dropbox engineers removed shared Android and iOS C++ code and replaced it with a platform-native Kotlin implementation. The engineers are pleased with the decision to rewrite the process, stating that error rates went down and upload performance greatly improved.
In their blog post, Sarah Tappon and Andrew Haigh, software engineers at Dropbox, advise that it “is important to make sure that the benefits will justify the effort” when embarking on such a rewrite. “At the end, it will help you determine whether you got the results you wanted.” They also advise shipping risky projects to a small audience that is “large enough to give you the data you need to evaluate success. Then, watch and wait until your data gives you the confidence to continue.”
Measurements showing a decline in error interactions and a rise in “all done” interactions following the release
One of the main design constraints of the camera upload process is dealing with Android’s strong constraints on how often apps can run in the background and what capabilities they have. “For example, App Standby limits our background network access if the Dropbox app hasn’t recently been foregrounded.” This limitation means that the app might only be allowed to access the network for a 10-minute interval once every 24 hours. The cross-platform version was not equipped to deal with the platform-specific constraints, and it was often doomed to fail. In contrast, Dropbox engineers tailored the new Kotlin implementation around these limitations.
As part of the rewrite, the Dropbox engineers designed the new native process for performance. First, they started utilizing parallel uploads. The C++ version uploaded only one file at a time. Implementing parallel uploads with Kotlin coroutines was significantly more manageable than it would have been with manual thread management in C++. Second, they optimized memory usage by “dynamically varying the number of simultaneous uploads based on the amount of available system memory” and reusing ByteArray buffers to avoid pressure on the garbage collector.
The Dropbox team employed several methods to validate that the rewritten process works as intended. One of the critical techniques was to validate “many low-level components by running them in production alongside their C++ counterparts and then comparing the outputs.” This technique allowed the team to confirm that the new components were working correctly before relying on their results.
Another technique the team used was being stricter regarding state transitions in the system. Each photo upload had a state assigned to it, and the engineers proactively validated each state transition against the list of allowed transitions. Tappon and High describe the outcome:
These checks helped us detect a nasty bug early in our rollout. We started to see a high volume of exceptions in our logs that were caused when camera uploads tried to transition photos from DONE to DONE. This made us realise we were uploading some photos multiple times!
Possible photo upload state transitions
When it was time to roll out the new implementation, the team made sure that they supported rolling back to the C++ implementation. In addition, the team first rolled out the implementation to an opt-in pool of beta users. “This pool of users was large enough to surface rare errors and collect key performance metrics such as upload success rate.” They monitored these key metrics in this population for several months to gain confidence it was ready to ship widely. The team concludes that the months spent in beta paid off and that they eventually completed the project ahead of the schedule.