Missing Piece before the wipe? - Critical audit alert

Sasha · September 16, 2019, 11:38am

Sorry, I thought you were part of the StorJ team, my bad.

Odmin · September 16, 2019, 11:53am

No problem
I just SNO like you

cdhowie · September 16, 2019, 8:45pm

Abnormal container termination would, from the outside, be mostly indistinguishable from a power/network cut.

The satellites should not assume that a storage node has successfully received a piece until the node returns success to it, which should only happen after the node ensures that the content has been successfully persisted.

-t 300 should absolutely not be required. If it is required then this is a serious bug that needs to be remedied ASAP.

kevink · September 16, 2019, 9:04pm

As many pointed out, if shutting down (even force shutting down) a node creates problems or corrupts the database, it should be considered a bug or a flaw in the software design. Every transaction should be atomic and in the same (or a recoverable) state between node and satelite. But I guess there have been performance optimizations that make a lot of operations take place in RAM and only get flushed to disk once in a while. So if you shut down your node, there is still a lot to flush to disk and if you forcefully shut down your node or it gets killed, then all those transactions are lost. This is however just a wild guess.

cdhowie · September 16, 2019, 9:06pm

Upload transactions must be persisted to disk before the node returns success. This is a fundamental principle of durability; if this is not true of the node software due to transient write caching, then the system is not durable.

The only acceptable failure mode is that the node has persisted the uploaded data but then is killed and cannot report success back to the uploader. In this case, the node has extra data that it doesn’t need, but this doesn’t cause data loss which is what we’re talking about in this thread. (This problem would have to be solved by periodically auditing node contents against a list of what it should have and throwing out data that it shouldn’t have.)

The Two General’s Problem means there is no perfect solution; we have to choose between “node doesn’t have data it should” and “node has extra data it shouldn’t.” The second of those options is the least problematic, so most systems that try to provide atomicity in unreliable circumstances will ensure that the first never happens at the expense of allowing the second.

BrightSilence · September 16, 2019, 9:42pm

I agree that the satellite should never think pieces are there when the transfer was interrupted, however the timeout would still be required to make sure data is not being written to the info.db when the process gets interrupted to prevent corruption. I’m not sure there is a whole lot you can do to prevent that without increasing the time to process these changes.

As for power failure, there is nothing you can do to completely prevent damage when power fails while data is being written to HDDs. That’s what UPS’s are for.

Sasha · September 16, 2019, 11:14pm

Yes, but a UPS is instructed to start shutting down the NAS when power is interrupted. Even if you set the shutdown process to 5 minutes after power failure, this signal doesn’t flow back down to the Docker container, which will ultimately abruptly be killed after 5 minutes.

A transaction should never be in cache or memory where a satellite thinks the node received the data until the node confirmed download and writing to file.

@Alexey, it would be great to see an official response from the StorJ team on the details we’ve discussed here in this thread and what is being done to improve and prevent this issue

cdhowie · September 17, 2019, 1:44am

The database is SQLite. As long as transactions are correctly used, the database cannot be corrupted by interrupting a write; SQLite keeps either a rollback journal or WAL, both of which guarantee that interrupting a write will leave the database in a fully consistent state.

In WAL mode, SQLite will write details about the committed transaction to the WAL and then return success to the caller; if the write process to the actual database file is interrupted at that point, the WAL will be replayed the next time the database is opened, which will continue writing the changes made by the transaction.

As long as the storagenode software waits for the SQLite library to return success when committing the transaction(s) before returning success to the satellite, everything should be fine.

So the question is: does the storagenode software do this? If it does not, the database structures itself will not be damaged, but it’s possible for records to be in a semantically inconsistent state.

Odmin · September 17, 2019, 5:30am

Let’s forget about storagenode for few min. and look into another linux applications that can produce heavy disk load, for example mysql. For example you have a lot of select and update on a single HDD drive and your disk response is high, suddenly you (or your UPS) decide shutdown mysql, mysql will waiting for compleate current transactions and flush all data from RAM to disk (keep in mind, you disk have a load before shutdown). All this operations take a time (much more then 10sec.).
Also look into linux service startup/shutdown scripts, how much waiting time you see?

So, let’s back to storagenode, could you please tell me more, what Storj Labs should fix? (keep in mind linux services that produce heavy load on disk)

kevink · September 17, 2019, 6:03am

That is an oversimplification. Most services just have a timeout for security without any problems if they had less time.
Of course mysql also tries to finish current transactions but it wouldn’t cause a problem if you give it less time. Typically mysql shouldn’t have any data in RAM that is not synced to disk already, this is by design.
Also where does the heavy load from STORJ come from? You have a parallel upload limit of maybe 15 pieces. Now don’t tell me the software needs more time to process the previous 15 uploads into the database than it takes to download the 15 new pieces… Because that would cause a serious backlog.
Downloads from the node will take even less to process.

I see no reason why it shouldn’t be possible to just kill storagenode and everything stays fine. All orders not written to DB didn’t send a confirmation to the sattelite so as far as the network is concerned your upload just failed. For downloads it is even easier because your node wasn’t able to finish it and therefore didn’t get a confirmation.
So where is that heavy load and complex problem that leads to a 5 minute shutdown window and possibly a database corruption?
Would be great if a dev could explain this.

BrightSilence · September 17, 2019, 7:47am

I know WAL mode is being used, yet we have still seen nodes running into db corruption.

cdhowie · September 17, 2019, 2:50pm

@Odmin We aren’t talking about a graceful shutdown here, so your thought experiment is not valid.

Instead, consider what happens if we SIGKILL mysqld, or there is a power cut. The journal file(s) kept for each database/table ensure that any data mysqld has claimed was successfully committed will 100% be there. If it didn’t make it into the table files themselves, it’s in the journal which will be replayed when mysqld starts up next time. Any transactions for which success was not reported might be in the journal (if mysqld was killed after writing the data to the journal but before returning success) but under no circumstances is a successfully-committed transaction lost (absent hardware failure or a mysql bug).

SQLite has a similar journal mechanism that should prevent database corruption even if the host process is forcibly terminated, and guarantee that any committed transactions are in persistent storage, whether that is the journal or the database file is not relevant as journal recovery is automatic. (If that does not happen then there is either a bug in SQLite or some other issue caused the corruption, such as a failing storage device or failing RAM.)

@BrightSilence I’m not ruling out a bug in SQLite, but it seems more likely that there is a hardware issue at play. I guess it depends what you mean by “corruption.” If the disk file won’t even load due to a malformed structure, that is a SQLite bug or a symptom of failing hardware. If the database will load but there’s an inconsistency in the data, that is storagenode’s fault for not properly using transactions to group statements that need to all be executed for the data to remain consistent.

BrightSilence · September 17, 2019, 8:29pm

Both database image is malformed and file is not a database have happened just by stopping the container without giving it enough time to finish. It could be replicated in testing as well, so not a hardware issue for sure.

cdhowie · September 17, 2019, 9:35pm

Hmm, this should definitely be investigated then. This should not be possible with SQLite if configured properly.

Sasha · September 22, 2019, 9:37am

@littleskunk

I found another failed audit, and after searching the logs there’s a delete entry then an audit entry afterwards.

 sudo docker logs storagenode 2>&1 | grep "3HLJ43IJTTMWLAACFQZSASKOQ4VUBRUHXA7ZP2GY532QNI4CQOSQ"

2019-09-21T16:41:19.023Z        INFO    piecestore      deleted {"Piece ID": "3HLJ43IJTTMWLAACFQZSASKOQ4VUBRUHXA7ZP2GY532QNI4CQOSQ"}
2019-09-21T17:09:48.349Z        INFO    piecestore      download started       {"Piece ID": "3HLJ43IJTTMWLAACFQZSASKOQ4VUBRUHXA7ZP2GY532QNI4CQOSQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT"}
2019-09-21T17:09:48.350Z        INFO    piecestore      download failed {"Piece ID": "3HLJ43IJTTMWLAACFQZSASKOQ4VUBRUHXA7ZP2GY532QNI4CQOSQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}

littleskunk · September 22, 2019, 2:18pm

12345678901234567890 (20 chars to be able to post this)

Sasha · October 1, 2019, 9:03am

Hi @littleskunk,
I’ve been monitoring the node logs since update to 21.3 and I’ve found another 2 failed audits but no record of an upload. Any ideas?

sudo docker logs storagenode 2>&1 | grep GET_AUDIT | grep "download failed"
2019-09-27T09:23:07.573Z        INFO    piecestore      download failed {"Piece ID": "MDJSSB7LKHHAJ7KHQF45HXB72N7CPJLSWKUX4MD7I44ASULCNNWQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}
2019-09-29T09:04:47.372Z        INFO    piecestore      download failed {"Piece ID": "I5S2NGOD3PQQC2CNOZ5MHPYX5BRWJ4LOWL3IRXWNARN6JRHL3AQQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}


sudo docker logs storagenode 2>&1 | grep "MDJSSB7LKHHAJ7KHQF45HXB72N7CPJLSWKUX4MD7I44ASULCNNWQ"
2019-09-27T09:23:07.572Z        INFO    piecestore      download started        {"Piece ID": "MDJSSB7LKHHAJ7KHQF45HXB72N7CPJLSWKUX4MD7I44ASULCNNWQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT"}
2019-09-27T09:23:07.573Z        INFO    piecestore      download failed {"Piece ID": "MDJSSB7LKHHAJ7KHQF45HXB72N7CPJLSWKUX4MD7I44ASULCNNWQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}

sudo docker logs storagenode 2>&1 | grep "I5S2NGOD3PQQC2CNOZ5MHPYX5BRWJ4LOWL3IRXWNARN6JRHL3AQQ"
2019-09-29T09:04:47.371Z        INFO    piecestore      download started        {"Piece ID": "I5S2NGOD3PQQC2CNOZ5MHPYX5BRWJ4LOWL3IRXWNARN6JRHL3AQQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT"}
2019-09-29T09:04:47.372Z        INFO    piecestore      download failed {"Piece ID": "I5S2NGOD3PQQC2CNOZ5MHPYX5BRWJ4LOWL3IRXWNARN6JRHL3AQQ", "SatelliteID": "118UWpMCHzs6CvSgWd9BfFVjw5K9pZbJjkfZJexMtSkmKxvvAW", "Action": "GET_AUDIT", "error": "rpc error: code = NotFound desc = file does not exist"}

donald.m.motsinger · October 1, 2019, 10:48am

You could have lost the files before the update. AFAIK the logs get deleted on every update if you don’t redirect them to a file.

KernelPanick · October 1, 2019, 11:38am

I suppose this will continue to happen even if the bug has been fixed, since the bug created the issue, it will continually appear unless data records on the nodes or data themselfs are cleaned up.

Sasha · October 2, 2019, 3:33am

Negative, i’ve reviewed logs before the update and no failed audits.