Recorded Future Sandbox Service Disruption

Incident Report for Recorded Future

Postmortem

Summary

On July 1, 2025, starting at approximately 15:42 UTC, our Recorded Future Sandbox (excluding US-hosted enterprise users) services experienced an outage due to a critical incident at Scaleway's nl-ams-1 datacenter. The outage was caused by the datacenter's thermal management solution being unable to cope with extreme temperatures in The Netherlands, leading to a forced shutdown of their services in the affected region. 

Our team detected the outage via automated alerts and quickly initiated efforts to restore service by provisioning resources in an alternate availability zone. We have reached out to Scaleway and they will create a post-mortem for this incident soon.

Impact

The incident resulted in the complete unavailability of several Recorded Future Sandbox environments.

Crucially, despite the service unavailability, no data was lost or compromised. All data stored in the affected AZ remained encrypted at rest for the entire duration of the incident.

Root Cause

The root cause of this outage was an external infrastructure failure at Scaleway's nl-ams-1 datacenter. Specifically, their thermal management system failed to maintain operational temperatures, necessitating a shutdown of their services within that datacenter.

Timeline of Events

  • July 1, 2025 - 15:42 UTC: Our internal monitoring systems triggered alerts indicating unavailability of several sandbox-related services. This was also the time Scaleway stated services were shut down.
  • July 1, 2025 - Shortly after 15:42 UTC: Our team immediately began investigating the alerts. Concurrently, we observed an incident posted on the Scaleway status page (https://status.scaleway.com/incidents/1vz4xfgy2gcl) confirming a widespread issue in their nl-ams-1 datacenter due to thermal management failures and subsequent service shutdowns.
  • July 1, 2025 - 15:45 UTC (Approx.): Our team initiated the process of provisioning replacement services in a different Scaleway Availability Zone (AZ) to mitigate the impact of the nl-ams-1 outage.
  • July 1, 2025 - 19:26 UTC: Recorded Future Sandbox  was successfully restored and became accessible to users.
  • July 1, 2025 - 23:17 UTC: internal pipelines providing data to Recorded Future Platform were successfully restored.
  • July 2, 2025 - 00:42 UTC: Scaleway officially reported the issue in their nl-ams-1 datacenter as resolved. (Note: Our services were recovered significantly earlier due to our AZ failover strategy).

What Went Well

  • Automated Alerting: Our monitoring systems effectively detected the service unavailability immediately, allowing for rapid initiation of incident response.
  • Quick External Cause Identification: The prompt update from Scaleway's status page allowed our team to quickly confirm the external nature of the incident and shift focus from internal debugging.
  • Effective AZ Failover: Our team's ability to immediately begin provisioning services in another Availability Zone was crucial in significantly reducing the overall downtime, allowing us to restore service well before Scaleway fully resolved their datacenter issue.

Remediations

Following the incident and the restoration of service in the affected Availability Zone:

  • Infrastructure Retirement: After the nl-ams-1 Availability Zone came back online, we retired the old infrastructure that was running in that single AZ.
  • Improved Frontend Failover Automation: We have improved our failover automation for the frontend service to handle entire Availability Zones going down. The essential data and services required to set up frontend deployments now run across multiple AZs.
Posted Jul 03, 2025 - 17:19 EDT

Resolved

Dear Recorded Future Customers,

We want to inform you that the Sandbox service disruption and degraded performance have now been resolved. System performance and response times are back to normal.

Please contact our support team at support@recordedfuture.com if you have any questions or concerns.


Regards,
Recorded Future Platform Operations
Posted Jul 01, 2025 - 19:13 EDT

Monitoring

Dear Customer,

A fix has been applied for all Sandbox sites, including for sandbox.recordedfuture.com. Our team is monitoring and addressing additional remaining service restoration, but the Sandbox feature should be working normally at this point. We will post an update when we consider the incident fully resolved.

Please contact our support team at support@recordedfuture.com if you have any questions.

Regards,
Recorded Future Platform Operations
Posted Jul 01, 2025 - 15:40 EDT

Identified

Dear Customer,

We are making progress with our recovery of the Sandbox feature to optimal performance levels, with our private.tria.ge site now available and operational. We are still working on restoring full access to the Sandbox on the sandbox.recordedfuture.com site, and will continue to provide updates as they become available.

Please contact our support team at support@recordedfuture.com if you have any questions.

Regards,
Recorded Future Platform Operations
Posted Jul 01, 2025 - 14:26 EDT

Investigating

Dear Customer,

We are currently experiencing a service disruption to our Sandbox feature, with access currently unavailable on sandbox.recordedfuture.com and private.tria.ge . We're currently reviewing the incident with high urgency, and will continue to provide updates as they become available.

Please contact our support team at support@recordedfuture.com if you have any questions.

Regards,
Recorded Future Platform Operations
Posted Jul 01, 2025 - 12:34 EDT
This incident affected: Sandbox.