Lower performance when running jobs on Rackham and Snowy closed

We have received reports from several users that jobs in some cases runs much slower today than a few months ago. We have started an investigation to see if any of the recent changes, such as BIOS and or Infiniband HCA firmware upgrades, have known performance issues. We will update this news as our investigation continues.

If you have any questions or want to share information regarding the above, please let us know.

UPPMAX Support Team

Update 2020-01-21 10:00

We have found that on a majority of compute nodes a kernel thread (“kworker”) can be seen pinned to and consuming 100% of a CPU core. Upon finding its origins it belongs to a lustre client process, which, amongs many things is responsible for mounting the project storage system Crex. Further investigation shows that this problematic behavior is reported as a bug and linked to a filesystem feature that was the culprit we had to provide a workaround for in December due to stability issues. The workaround was installed on December 9th. The issue was permanently fixed in the January 8th service window, and we believe that the remaining performance issues is due to the workaround still being implemented (to be clear: the workaround should not be needed anymore). We are now in the process of removing the workaround and expect the kernel thread to stop consuming 100% of a CPU and that we gain back the performance. As the kernel thread is believed to exist on all compute nodes this problem affects all jobs (not just MPI), that are unlucky to get allocated a core with the kworker thread pinned.

Update 2020-01-28 10:00

The problem was fixed last week and we have reports of the performance returning to normal. Please contact the support if you have any questions or if you find that your jobs are still running slow.

Affected systems: rackham and snowy

Written by Support Team on January 16, 2020

←→Top