On August 3, 2015 we were alerted that the system was not generating absence attendance records correctly. During our investigation we ended up realizing that this was an un-noticed after effect of the outage from July 31 .
We learned that when Amazon went down it left a locking mechanism that normally prevents running certain types of background jobs simultaneously in a locked state, preventing those jobs from running. This caused a backlog of what we call “Reminder” jobs to build up and go un-noticed.
Once we identified the root cause we re-ran the missing jobs and re-generated the absence records. There was no data lost other than it not being available for 3 days between the AWS outage and when the issue was fixed on August 3, 2015.
In order to avoid similar issues in the future we implemented a few measures in the system:
We replaced the locking mechanism on the jobs with one that has a proper timeout mechanism, this way even if the job was to get stuck again, the lock would expire a few minutes later and not be an issue.
We added alerting to the system when there is a large number of “Reminder” jobs waiting so that we will noticed such issues in the future.
If you have any questions please let us know by clicking here .
— Piotr Banasik