On 11/17/22, a standard deploy was released at 02:00 am ET. Included as part of this deploy were code changes related to a new validation to check for maximum cases users can select on a multi-select case list screen . A subset of BHA users first experienced an issue at 8:42 am ET and were presented with an error message while attempting to choose a case from a multi select case list. By 11:46 am ET, our incident response team had resolved the issue, and Web Apps was fully functional.
Summary of the incident
On 11/17/2022, a standard code deploy went out to CommCare’s production environment, which contained code changes that prevented some users from choosing a case within a multi select case list.
At 10:02 am ET, a ticket was opened by the Dimagi BHA project team alerting support to several errors in the error log.
At 10:42 am ET, a Priority 1 incident was declared and an emergency response team was assembled to address the issue. Based on the timing of the deploy and the fact that deploy included a code change related to multi select, the team had a strong hypothesis that the deploy had introduced the issue.
At 10:52 am ET, the developer who had created the change confirmed the origin of the issue and that the code needed to be removed. A response team then began work to reverse the change that introduced the issue and re-deploy the system.
At 11:34 am ET, the reversion of the problematic change was deployed. Our error monitoring tool confirmed that the errors had stopped, and we asked users to confirm functionality.
By 11:46 am ET, the BHA delivery team had heard back from the users and were able to confirm that they were no longer experiencing the issue.
Our Next Steps
The errors the users saw were due to functionality related to constraining the total number of cases a user can select from a multi select case list. Thursday morning's deploy introduced changes which modified the code in a way that broke the selection for some users who had previously used multi-select case lists.
We have protection against these types of issues (changes to the serialization schema) when updating model definitions, but in this case it was a single attribute's value that changed, which was not caught.
Going forward, the engineering team will incorporate awareness of this type of risk into code review. The engineering team is also exploring a method to reduce the number of errors reported so that the issue will no longer be blocking
We understand that our users expect a positive user experience, and we are sorry for the inconvenience this has caused. Thank you for your patience and support. Please reach out to firstname.lastname@example.org if you have further questions about the incident.