Down the Rabbit Hole, An Adventure in Over-Engineering
Software Engineers are often presented with problems that seem conceptually straightforward, but on further consideration present a number of complex issues that need to be considered and subsequently resolved in order to make the proposed solution work as a whole.
It’s easy to become focused on solving these issues one by one as they present themselves, but each solution often then uncovers further complexities and problems inherent in the chosen approach, and very quickly what was a simple solution to a simple problem turns into a multitude of other complicated engineering tasks.
Sometimes all that complexity can be avoided altogether by reassessing the original problem we were trying to solve and applying a simpler solution—one that may not seem as elegant but is ultimately more robust, easier to maintain, and easier to explain. Treat discovered complexities in the implementation as a warning sign that the original approach might need to be reassessed.
This post describes an example of one such engineering problem, the journey down the rabbit hole that can ensue, and the simpler solution that could have avoided most of the complexity.
An existing piece of software generates an existing report that gets printed out and filed at the end of the system workflow. Our software needs to be extended so each report additionally needs to be submitted electronically to another legacy online system for archival. The prescribed process for this is to copy each report to a directory on a shared network drive to be picked up and archived by the other system. For our case, within 60 minutes of the document being written to disk was an acceptable time-frame during which to transfer the report and reports do not get modified once they are written to disk.
The proposed solution is to build a dispatcher service that takes files written to a local directory and moves them to the shared network drive. If the operation fails (eg. the destination network drive is down or full), abort and subsequently retry the move operation every 30 minutes until successful.
The first step in the implementation was to create a native process with a filesystem watcher that monitors the directory into which the application writes its reports. When the watcher reports that files have changed, the software moves those files to the network drive, or if unsuccessful leaves them locally and tries again in 30 minutes, which allows one additional retry to keep us within the 60-minute preferred time window.
Now we start to discover problems in our approach and attempt to solve them, one by one.
|The filesystem watcher notifies us of changes before the files have finished writing, we are copying half-written files.
|The copy is not urgent, so create a timer to do the copy 30 minutes after the file has been written. We can use the same timer to handle retries on failure.
|The filesystem watcher often reports multiple notifications of filesystem writes for every file, causing more 30-minute timer instances to be created.
|If a timer is already pending, don’t create another one.
|We discover inconsistent problems with crashes and timers not being reset correctly, references to the timer object aren’t thread-safe.
|Put a lock around the timer object to prevent concurrent access across threads.
|An incomplete file was copied before it had finished writing. This is because the writing of that file happened exactly 30 minutes after the timer from a previous write was queued up.
|Use file locking, so files still being written are marked as write-only, so read or copy operations of any file still being written to will fail gracefully. Also, maintain a single list of pending file copies.
|Some file copies are taking a long time, presumably because the network share is under heavy load at certain times of the day, to the extent that one copy operation took longer than the interval on the timer when set to a smaller number for testing. As the old timer was no longer “pending” but its handler was still running, the filesystem watcher handler was able to create a new timer instance whilst the previous handler was still running. If left for long enough it is possible to handle multiple timer handlers running in parallel, causing concurrency issues.
|Store a thread-safe flag to indicate whether the timer function is running, then clear it at the end. If the flag is already set when a new timer handler is executed then abort.
This is getting messy!
|An exception gets thrown in the timer function due to a failed file copy, and the “busy” flag never gets cleared.
|Add a big exception handler around the whole function so it always cleans up after itself, clearing the “busy” flag and ensuring any open files are closed.
|File copies are still really slow. It turns out the drive on the network share has over 10,000 files in the one folder, causing access to become very slow. Most filesystem implementations on most operating systems suffer similar limitations and perform very badly with large numbers of files in the one folder.
|Create sub-folders based on the first few characters of the filenames being written (or some other reasonable partitioning strategy).
|If the system loses power either half-way through writing a file or after writing a file but before the timer activates then that file will never get moved, because we never get any further change notifications on those files from the filesystem watcher after a reboot.
|Spend a long time implementing a solution that persists the list of pending file copies to disk, handling power-outage cases so that list also isn’t corrupted.
We now have a highly complicated solution with many moving parts, with lots of special-case checks to work around specific timing and concurrency issues. Given the complexity of the solution, there is still a high chance of additional code bugs such as race conditions, filesystem locking issues etc, loss of data during crashes or power outages etc. Beyond the impact on system stability, the solution is also challenging from a maintenance perspective, especially if multiple people are working on or with it over time.
Reassess the Problem
Let’s take a look back at the original problem description. At its core, all we are being asked to do is to take new reports written to the intermediate output directory and move them to a network share in a way that handles errors and automatically retries should they occur.
All the problems with timer overlaps and restarts were simply because we decided early on to move files a fixed amount of time after they were generated, which was never part of the original problem description. The problem with losing track of files still requiring a copy came about because we were trying to make the solution too smart.
We went so far down the rabbit hole that many of the problems we were spending the time to solve were a result of our choice of implementation and assumptions made, not because of the original problem description. We did, however, identify a number of practical limitations on our journey in file locking, synchronisation and how files are stored on the shared network drive—useful information to be mindful of when reassessing our approach.
It’s time to take a step back and reassess our implementation.
A Simpler, More Robust Solution
Let’s go back to basics and start with a simple shell script to move all files in a specific directory to a network share every 30 minutes. The example is a bash script, but the same logic can be implemented in any language or script.
while true; do
mv reports/* /mnt/document_archive/
/mnt/document_archive is the network share mounted on the filesystem, and the assumption is that this script will be set to run once on system startup (or ideally via systemd, flock or similar to ensure it’s restarted if something goes wrong).
This simple script doesn’t suffer from any of the problems we encountered above, except for one—the issue of copying/moving files currently being written to disk. This is easily solved by always writing reports with a different file extension, then renaming them once written successfully. File renames are atomic operations – they are never accessible in an incomplete state, and therefore provide a clean barrier by which we can indicate completion of writes. e.g.
// save the report with the extension .tmp, then rename if successful
Then we can change the script to only move files with the .pdf extension, as any half-written files will always have the .tmp extension:
while true; do
mv reports/*.pdf /mnt/document_archive/
Done. That’s it. There’s no need whatsoever for monitoring the filesystem, beyond the quick scan of the contents of the reports directory every 30 minutes as part of the file move operation. If any move fails, the original file will remain in its original location and will be picked up again by the next iteration of the loop. If the system loses power, any pending files will also get picked up automatically because it’s the filesystem maintaining the state of the files, not something we are storing in memory. A half-moved file will still remain present and whole on the local filesystem after a reboot, as the original file is only deleted once the move has completed.
While most engineers will face situations similar to this situation across their career, there are some straightforward principles to fall back on as you peer into the rabbit hole:
- Never assume the current approach is the only way to solve a problem.
- Be wary of assumptions made in the current implementation. Always refer back to the original problem description!
- If the solution to a problem seems like it’s more complicated than it needs to be, listen to your gut and reassess the original problem description against the current approach, there’s often a less problematic way to solve it.
- Don’t be afraid to throw away existing code if there is a more robust and maintainable approach.
Header image courtesy of Iswanto Arif.