AMD EPYC Rome processors freeze after 1044 days of continuous operation and cannot be fixed

AMD published (PDF) Information on an error occurring in EPYC 7002 Rome server processors – it causes a kernel hang after 1044 days of continuous operation. That is, for proper operation, the server must be restarted every 2.93 years. And AMD will not fix this bug.

The problem is related to the fact that the core does not leave the CC6 power saving mode (Core C6 State), which provides for a decrease in voltage and a decrease in frequency when idle. AMD clarified that the timing of the error could depend on spread spectrum modulation and the REFCLK frequency reference, which help the chip keep track of time.

plausible hypothesis The cause of the error was provided by Reddit user acid_migrain. According to his version, in reality the error manifests itself not after 1044, but after 1042 days and 12 hours. The scaling of the timestamp counter works at 2800 MHz. By simple calculations it turns out that 2800 × 106 × 1042.5 roughly equals 0x3800000000000000 – here “Too many zeros for it not to be a coincidence”. There are two simple solutions to the problem: either restart the server every 1044 days (based on AMD information) or disable CC6 power saving mode.

AMD EPYC Rome series processors were released in 2018 and it is possible that some of their owners have already encountered this problem. The manufacturer added that it has no plans to fix the bug – it might be too expensive or it might not affect as many customers.


