After reading this [http://nfs.sourceforge.net/#faq_d10] I think I might understand the statement about incompatibility of SQLite and NFS (which, BTW, I am using to export storage from the NAS VM (*) to the StorJ VM running on the same ESXi host), but I still do not understand why this is an issue when the NFS export in question is used by only one client, as the locks do not work across different NFS clients, but should still work on the same NFS client.
Nonetheless, I will try to share the NAS space to the StorJ VM thorough iSCSI (running StorJ on the NAS itself is not an option for me) after I return home, but that sounds like a major remodeling of the setupâŚ
(*) HBA is passed through as a raw device to the NAS VM, which eliminates most issues of running a ZFS based NAS as a VM.
The corrupted database should not affect audits.
Satellite requests any random piece and expect the correct hash from the storagenode. Storagenode confirming that it accepted the challenge and should give the right hash for requested piece.
If storagenode accepted the challenge (so itâs online), but didnât gave the requested hash within a timeout (5 minutes), then it will be placed into the containment mode and will not receive any new piece from that satellite until give the requested hash, it will be asked for it three more times. If node is unabe to deliver the requested hash the audit considered as failed.
If the storagenode would fail significant amount of audits in row it will be disqualified.
Thanks @Alexey for the expansion. I have two questions regarding the logging of this auditing process:
Can I see in the logs whether the piece was found and/or a hash sent?
When a hash is sent, does the satellite reply with the info whether or not the hash was correct? If yes, is this logged on the storage node?
And is it possible for me to create the hash of one of the pieces in question (which apparently reside on my disk) manually and somehow check its correctness? I am asking this because I can still rule out that data got corrupted at rest on the HDD, so the reason must be something else.
Yes, youâll see a downloaded line with the type GET_AUDIT if it was successful. If you see a download failed line with GET_AUDIT the audit failed for some reason. The error will mention the specific reason, like âfile does not existâ.
I donât believe there is a log for this, but satellites do communicate back the total number of audits and the number of successful ones. You can find these using the Storage node dashboard API (v0.19.0) .
I donât think it is. During audit the satellite retrieves a stripe (a small bit of a piece) from all nodes hosting it. The satellite then tries to recreate the data using erasure codes and detects which pieces are corrupt. This is not a process you can do as a single node, it requires the stripes from the other nodes.
Thanks for the hints and infos! This is what I can see in the logs: All lines in the logs from four days before the disqualification till about three days after which contained (case insensitive) âauditâ were all âGET_AUDITâ messages of severity INFO of either type âdownload startedâ or âdownloadedâ. None of them mentioned any kind of error. The only strange thing with these messages that I noticed is that there are more âdownload startedâ (95) entries than there are âdownloadedâ entries (32), but after a short check it seems that some pieces have multiple âdownload startedâ entries with fewer or just a single âdownloadedâ entries. (*)
The only âdownload failedâ messages were all not of type âGET_AUDITâ and related to network issues, probably for being too slow (use of closed network connection, connection reset by peer, broken pipe).
Iâm still pretty much in the dark here. Is there a way of increasing the log level? For my next install that is, as in the meantime I have been disqualified from all four satellitesâŚ
Edit:
(*) It seems there are three more unique piece ids with âdownload startedâ than with âdownloadedâ: