Files and directories may be hidden on Bianca closed
We have received reports of missing files and directories inside the /proj and /proj/nobackup directories on Bianca. Upon inspection the files are actually there, but are not shown by the “ls” command. If you are working on Bianca, you should be aware of this as for example jobs of type “process all files in directory X and compile the result” might finish fine but create false results due to missing input, thus risking incorrect results and conclusions.
A workaround was implemented on Wednesday 2018-03-08 that mitigates this issue.
In all cases we have investigated so far the files (or directories) appear to be removed, but actually exist. This is likely an issue due to the storage driver for GlusterFS.
We implemented a workaround during the maintenance day that mitigates this issue. The trade-off is in some cases slower metadata operations. We aim to address this issue during our next maintenance day.
On the Wednesday maintenance window we will stop the queues and unmount the storage system from the Bianca clusters. We have reason to suspect an optimizing feature might be part of the problem. We will also likely try a more recent Gluster storage driver.
We are able to treat the symptoms by manually correcting directory entries. In all known cases hidden files should be visible again. However, it is still possible to trigger the problem and make files hidden again, hence we are still working on finding the main cause.
We have sent out an email to all active Bianca users to inform about this problem and the need to be extra careful when performing analysis.
We are still working on finding what caused this issue.
Unfortunately the problem persist and we have reasons to believe the storage server software is the culprit. Making changes to the server will require us to stop the clients, which means we will have to stop any running jobs. We have created a service reservation for March, which prevents long running jobs to start before 2018-03-07. We have however decided to allow the current running (and short pending jobs) to complete (and start).
We are thankful to SciLifeLab for providing an informative warning to all users who are currently working or planning to work on Bianca while UPPMAX is working on fixing the problem:
The problem is a bit tricky to catch since some but not all files in a directory might be hidden. If you are working on Bianca, you should be aware of this as for example jobs of type “process all files in directory X and compile the result” might finish fine but create false results due to missing input, thus risking incorrect results and conclusions. I’m informing our staff here and you should be aware of this in your projects
The UPPMAX storage team is still investigating this issue. We have made progress and understand the problem better. By analyzing the network it appears as if the server sends the client a malformed list of files. This might unfortunately be a bug introduced in Gluster 3.10.10 which we updated to in early February. We are discussing a downgrade to 3.10.9 but we need to gather more information (as 3.10.9 had its set of problems).
Finally as many users are naturally worried about their data. In no known cases so far have missing files actually been missing, but hidden during a “ls”.