Cloud service reliability issues sounded again! ChatGPT crashed, and Cloudflare disclosed details of the 5-hour outage.

The reliability of cloud services has once again sounded the alarm.

On November 18 local time, cloud infrastructure giant Cloudflare experienced a service outage, causing many major websites around the world to become inaccessible.

According to Downdetector, a website outage tracking agency (whose own website was also temporarily inaccessible to some users), Anthropic's Claude chatbot... Trump's Truth Social and Musk's social media platform X were among those affected, and some digital services of the New Jersey public transportation system were also paralyzed due to disruptions.

Meanwhile, OpenAI's status page also showed later that day that ChatGPT and its Sora short video app had been fully restored after the outage caused by a "third-party service provider" issue.

Cloudflare was founded at Harvard University in 2009 and officially launched its first beta version in 2010. It went public on the New York Stock Exchange in 2019. Listed on an exchange, the company currently serves 30% of Fortune 1000 companies. Its core services include DDoS (Distributed Denial of Service) defense, an attack that overwhelms a target website with a massive number of fake requests, causing it to crash. According to foreign media reports, the company's traffic management and security services cover approximately 20% of internet traffic.

As a result of the incident, Cloudflare's stock price fell 2.83% by the close of trading on the US stock market on the 18th.

Cloudflare co-founder and CEO Matthew Prince stated that this was Cloudflare's most severe outage since 2019, adding, "Such an outage today is unacceptable... On behalf of the entire Cloudflare team, I apologize for the inconvenience caused to the internet."

Error messages appearing on affected websites

Cloudflare CTO Dane Knecht also posted on social media, expressing his deep apologies for the outage. He stated that the incident was caused by a potential flaw in a service that supports the company's botnet detection and mitigation capabilities. The flaw caused the service to crash after a routine configuration change, leading to widespread degradation of the network and other services, rather than an attack.

Knechett stated that the outage, its impact, and the recovery time were unacceptable. "We have begun work to ensure such incidents do not happen again, but we are well aware that it has indeed had a real impact. The trust our customers place in us is our most valuable asset, and we will do everything in our power to regain that trust."

Screenshot of a tweet from Cloudflare CTO Dane Knechett

On the morning of November 19th local time, Cloudflare released a full report detailing the nearly 5-hour incident: the impact began at 11:28 AM local time on the 18th, with errors first observed in customer HTTP traffic; at 2:30 PM, the major impact was resolved, downstream affected services began to show a reduction in errors, and most services began to run correctly; at 5:06 PM, all downstream services were restarted, all operations were fully restored, and the impact ended.

Cloudflare stated that when the outage occurred, the company "initially mistakenly suspected that the observed symptoms were caused by a massive DDoS attack," but subsequently correctly identified the core issue—a change in the underlying ClickHouse query behavior that generated the file. The file contained a large number of duplicate "feature" lines, causing the Bot Management module to trigger an error. This resulted in the core agent system returning HTTP 5xx error codes for any traffic that relied on this module. Simultaneously, when the erroneous file containing more than the feature limit propagated to the servers, it triggered a system panic at Cloudflare. Furthermore, this also affected the Workers KV and Access services that the company's customers relied on the core agent for.

Cloudflare subsequently resolved the issue by stopping the generation and propagation of the faulty signature file, manually inserting a known good file into the signature file distribution queue, and then forcibly restarting the core agent. The number of 5xx error codes subsequently returned to normal.

Timeline of Cloudflare's Outage

Cloudflare stated, "Given Cloudflare's importance to the internet ecosystem, any disruption to any of our systems is unacceptable," and expressed deep regret for the impact on customers and the internet as a whole.

Cloudflare stated that it has begun investigating how to strengthen the system to prevent similar failures in the future, including enhancing the ingestion processing of Cloudflare-generated configuration files in the same way as user-generated input; enabling more global emergency stop switches for features; eliminating the possibility of core dumps or other error reports exhausting system resources; and reviewing failure modes of error conditions in all core agent modules.

According to foreign media reports, less than a month before this incident, Amazon... Cloud services had just experienced a full-day outage that crippled multiple network services, and subsequently Microsoft... Azure cloud services and 365 office suite have also experienced global outages.

As early as July 2024, cybersecurity The company CrowdStrike once caused a massive system failure due to a flawed software update, resulting in a chain reaction of flight cancellations, disruptions to financial services, and hospital delays in surgeries.

(Source: The Paper)

Cloud service reliability issues sounded again! ChatGPT crashed, and Cloudflare disclosed details of the 5-hour outage.

Read next

Musk: In the future, AI and robots will make money "meaningless" and work "optional"!

Palantir CEO slams short sellers: "Shorting us and Nvidia is simply insane!"

Options Investment Report | Musk's trillion-dollar compensation plan shows a predominantly bearish sentiment among major options traders.

Options Smart Investment Report: Nvidia Stock Price Fluctuations Show Significant Divergence Among Large Investors in Options Orders