System node reboots and memory errors Miarka closed
On the evening of August 18, miarka-q, which manages Slurm reported several severe memory errors, causing reboots and unreliable operations. During the efforts to troubleshoot this, the memory configuration of the miarka1 login node was also affected and both these machines experienced unscheduled hard reboots.
Both machines are now in operation again and it’s possible to submit jobs. A few jobs in queue just might have been lost, but no jobs that were just running during the outage should be affected. miarka1 has a reduced memory size and slightly reduced memory performance overall. This will be remedied at a later date, with another reboot.
Affected systems: miarka
Written by Support Team on