Web Interface Partial Outage

Incident Report for Amplifi.io

Postmortem

One of the multiple RethinkDB servers went offline due to a high i/o operation of the Ceph disk subsystem. The disk was busy because a few of the backups were taking longer than usual and so they had not finished before the time they are suppose to finish. Once Rethink server failed the tables residing on that server fell into a unique condition where one of the servers still had a connection and therefore did not release the tables to automatically migrate to another server. This is a very unique condition where failover can not take place as noted in this article: https://rethinkdb.com/docs/failover/ Once Ops determined that its a rethink issue, they proceeded with releasing that server and having the recovery take place until the server came back up. We see that there is an open issue on github for this condition issue #4357 and this has not been resolved. At this point the team is creating a script to detect this condition and try to mitigate issue while Rethink folks are addressing the open issue.

Posted Sep 09, 2020 - 04:23 PDT

Resolved

One of our production servers got hung up and had to be restarted. This affected communication services and caused a partial outage on multiple instances. This issue has been resolved and systems are back up to full speed.

Posted Sep 02, 2020 - 14:00 PDT