Intermittent errors on Web Apps
Incident Report for CommCare HQ
Postmortem

Overview

CommCare HQ experienced a period where users faced intermittent errors while using Web Apps. As part of a deliberate maintenance effort to modernize the codebase and improve performance, an update to the underlying code for Web Apps was released the evening of February 17th. This maintenance effort inadvertently changed how the database system handles specific app connections. This put those affected connections into a broken state.

Summary of the incident

At 14:35 UTC on February 18, 2021, our engineering team received internal reports that intermittent errors were experienced by users attempting to use Web Apps.

Due to the above issue our engineers decided to roll back the previous night’s code, as said code was the likely culprit, and to test if this would resolve the issue. This rollback was completed by 14:50 UTC. While this improved the situation, it did not resolve the issue completely.

By 15:20 UTC, we had pinpointed the source of the intermittent errors to a specific network machine and restarted it. By 15:48 UTC the machine was back up and running and the issue was resolved entirely.

Our Next Steps

Further analysis on the issue revealed that the errors were happening due to an increase in concurrent connections on the affected machines. Our engineering team will work on further improving tests to include concurrency testing before release of code in this area.

We are sorry for the inconvenience this caused. Thank you for your patience and support. Please reach out to support@dimagi.com if you have further questions about the incident.

– CommCare HQ Support Team

Posted Feb 25, 2021 - 10:54 UTC

Resolved
Dear users, this issue has now been fully resolved. We appreciate your understanding and patience while we worked to resolve this problem.
Posted Feb 18, 2021 - 16:18 UTC
Monitoring
Dear users, our engineers have identified the issue and implemented a resolution. Users should no longer be seeing error messages on Web Apps as of 15:48 UTC. We're closely monitoring the site. Thank you for your patience while we work on this issue.
Posted Feb 18, 2021 - 16:00 UTC
Update
Dear users, our engineering team is continuing to look into the issue, and we will provide an update within the next 30 minutes. We apologize for the inconvenience being caused for users.
Posted Feb 18, 2021 - 15:46 UTC
Investigating
Dear users, since 14:58 UTC, CommCare has been experiencing an issue resulting in intermittent errors on Webapps. Our engineers are actively investigating this with the highest priority. We will be sure to provide frequent updates.
Posted Feb 18, 2021 - 14:50 UTC
This incident affected: www.commcarehq.org (Syncs with CommCare HQ, Form Submissions, Web Apps).