Overview
CommCare HQ experienced a period where users faced intermittent errors while using Web Apps. As part of a deliberate maintenance effort to modernize the codebase and improve performance, an update to the underlying code for Web Apps was released the evening of February 17th. This maintenance effort inadvertently changed how the database system handles specific app connections. This put those affected connections into a broken state.
Summary of the incident
At 14:35 UTC on February 18, 2021, our engineering team received internal reports that intermittent errors were experienced by users attempting to use Web Apps.
Due to the above issue our engineers decided to roll back the previous night’s code, as said code was the likely culprit, and to test if this would resolve the issue. This rollback was completed by 14:50 UTC. While this improved the situation, it did not resolve the issue completely.
By 15:20 UTC, we had pinpointed the source of the intermittent errors to a specific network machine and restarted it. By 15:48 UTC the machine was back up and running and the issue was resolved entirely.
Our Next Steps
Further analysis on the issue revealed that the errors were happening due to an increase in concurrent connections on the affected machines. Our engineering team will work on further improving tests to include concurrency testing before release of code in this area.
We are sorry for the inconvenience this caused. Thank you for your patience and support. Please reach out to support@dimagi.com if you have further questions about the incident.
– CommCare HQ Support Team