Memory issues on some rackham nodes closed

We’re seeing issues related to memory on some rackham nodes. This relates to the kernel at times being unable to get the memory it needs.

This can show up as a very diverse set of symptoms, including failed I/Os, failed memory allocations or possibly bus errors.

We’re still investigating the issue and do not yet have a timeline for when it will be resolved.

Final ticket report

We changed settings for the kernel and slurm to improve the chances of memory being available for the kernel.

Update 2018-08-07 09:57

We believe we may have found at least part of the reason for this (a configuration that went astray in a cleanup). We’ve fixed this and will monitor to see if it helps.

Update 2018-08-13 13:40

Changing this parameter definitely seems to have helped, although we have seen a few more issues. We’ve tuned the configuration slightly and will continue monitoring.

Update 2018-08-16 11:06

After additional tuning on the evening of Tuesday the 14th, we think this issue should be wholly solved but will continue monitoring.

Update 2018-08-17 08:03

Nothing new that implies a problem, so we’re confident this is resolved for now.