Mobile asset upload outage in EU

Incident Report for Fullstory

Postmortem

Postmortem

From Wednesday July 16, 8:05 PM UTC through Thursday July 17, 09:42 AM UTC an update to the Native Mobile asset upload service caused 50-70% of all asset management requests to fail due to an issue in routing requests for rs.eu1.fullstory.com and mr.eu1.fullstory.com.

  • All native mobile builds targeting the EU environment would likely have failed during the incident
  • iOS Native Mobile session capture in the EU was impacted at the point that an asset upload failed
  • Android Native Mobile session capture in the EU environment were not impacted
  • Native Mobile builds targeted at the NA environment were not impacted
  • Native Mobile session capture the NA environment were not impacted
  • Web session capture in NA or EU was not impacted

Customer Impact

During the incident window any new Native Mobile build that performed asset uploading or checking would have failed due to 404 errors when hitting the necessary endpoint. As multiple calls are made during the build process almost all builds were likely impacted during the partial outage. Retries of failed builds would have resulted in the same failures until the underlying issue was resolved.

Additionally, session capture was disrupted for applications using the iOS SDK that perform asset uploads at runtime, as 404 errors during capture result in capture being stopped immediately. Sessions in progress were interrupted and new sessions could not be created, and this capture data is not available. Android applications were not interrupted, however assets uploaded during these sessions are likely not available during playback.

Root Cause

An update to our asset management service was deployed on July 16th to improve the reliability of asset uploads for Native Mobile builds. To mitigate a source of failures during asset uploads, the service was divided into two portions to handle asset uploads in a robust manner to separate these endpoints from the resource crawling and fetching functionality. To support this change, a routing update for our public endpoints was necessary to direct asset upload requests to the appropriate service. While the necessary routing changes were programmatically applied in our NA production and staging environments, the update was not successful in our EU production  environments. As a result, subsequent asset upload requests were failing 50-70% of the time because they were not delegated to the appropriate service.

Resolution

The issue was addressed on July 17 at 9:20 AM UTC by manually applying the necessary routing changes. As soon as these changes were completed, all requests were properly routed to the correct asset upload service, and normal functionality was restored.

Process Changes and Prevention

We are committed to preventing this incident in the future. We’ve completed the following action items:

  • We have increased the frequency of checks that detect differences in configuration between our environments and the expected configuration and between our NA and EU production environments directly.
  • We have added additional public monitoring checks for all public-facing asset management endpoints used by the Native Mobile SDK, expanding upon our existing coverage of capture and API endpoints.

In addition, we taking the following steps:

  • We will update the iOS SDK so that asset upload failures will not disrupt session capture, aligning with the current behavior of the Android SDK.
  • We will add additional prompts and alerts when production infrastructure changes are applied to ensure that all environments are thoroughly validated.
  • Add additional automated integration testing that covers all the expected uses of our exposed asset management endpoints to ensure proper functionality.

We deeply regret this incident and invite any Fullstory customer who was materially affected to contact support@fullstory.com. We stand by ready to fully address all of your concerns.

Posted Jul 18, 2025 - 13:24 EDT

Resolved

An issue was identified with mobile capture for EU orgs. This incident has now been resolved.
Posted Jul 17, 2025 - 05:58 EDT

Monitoring

An issue was identified with mobile capture for EU orgs. A fix has been implemented and we are now monitoring.
Posted Jul 17, 2025 - 05:52 EDT

Identified

An issue was identified with mobile capture for EU orgs. We have determined the cause and rolled back the change that resulted in this issue.
Posted Jul 17, 2025 - 05:11 EDT

Investigating

We are currently investigating an issue with mobile capture for EU orgs.
Posted Jul 17, 2025 - 05:08 EDT
This incident affected: Data Capture (Native Mobile Capture).