Summary
On July 1, 2025, starting at approximately 15:42 UTC, our Recorded Future Sandbox (excluding US-hosted enterprise users) services experienced an outage due to a critical incident at Scaleway's nl-ams-1 datacenter. The outage was caused by the datacenter's thermal management solution being unable to cope with extreme temperatures in The Netherlands, leading to a forced shutdown of their services in the affected region.
Our team detected the outage via automated alerts and quickly initiated efforts to restore service by provisioning resources in an alternate availability zone. We have reached out to Scaleway and they will create a post-mortem for this incident soon.
Impact
The incident resulted in the complete unavailability of several Recorded Future Sandbox environments.
Crucially, despite the service unavailability, no data was lost or compromised. All data stored in the affected AZ remained encrypted at rest for the entire duration of the incident.
Root Cause
The root cause of this outage was an external infrastructure failure at Scaleway's nl-ams-1 datacenter. Specifically, their thermal management system failed to maintain operational temperatures, necessitating a shutdown of their services within that datacenter.
Timeline of Events
- July 1, 2025 - 15:42 UTC: Our internal monitoring systems triggered alerts indicating unavailability of several sandbox-related services. This was also the time Scaleway stated services were shut down.
- July 1, 2025 - Shortly after 15:42 UTC: Our team immediately began investigating the alerts. Concurrently, we observed an incident posted on the Scaleway status page (https://status.scaleway.com/incidents/1vz4xfgy2gcl) confirming a widespread issue in their nl-ams-1 datacenter due to thermal management failures and subsequent service shutdowns.
- July 1, 2025 - 15:45 UTC (Approx.): Our team initiated the process of provisioning replacement services in a different Scaleway Availability Zone (AZ) to mitigate the impact of the nl-ams-1 outage.
- July 1, 2025 - 19:26 UTC: Recorded Future Sandbox was successfully restored and became accessible to users.
- July 1, 2025 - 23:17 UTC: internal pipelines providing data to Recorded Future Platform were successfully restored.
- July 2, 2025 - 00:42 UTC: Scaleway officially reported the issue in their nl-ams-1 datacenter as resolved. (Note: Our services were recovered significantly earlier due to our AZ failover strategy).
What Went Well
- Automated Alerting: Our monitoring systems effectively detected the service unavailability immediately, allowing for rapid initiation of incident response.
- Quick External Cause Identification: The prompt update from Scaleway's status page allowed our team to quickly confirm the external nature of the incident and shift focus from internal debugging.
- Effective AZ Failover: Our team's ability to immediately begin provisioning services in another Availability Zone was crucial in significantly reducing the overall downtime, allowing us to restore service well before Scaleway fully resolved their datacenter issue.
Remediations
Following the incident and the restoration of service in the affected Availability Zone:
- Infrastructure Retirement: After the nl-ams-1 Availability Zone came back online, we retired the old infrastructure that was running in that single AZ.
- Improved Frontend Failover Automation: We have improved our failover automation for the frontend service to handle entire Availability Zones going down. The essential data and services required to set up frontend deployments now run across multiple AZs.