What you could try to check and verify the values directly in the database. When the node is switched off and the value in the database is 11.2 TB then indeed when the node starts it should show 11.2 TB.
Also what you could try is to move all databases into a new folder (while node is offline of course) and let it create everything from the scratch and let it running to see if the error reproduces.
So looking for non-lazy filewalker progress, I grepped an ls -l /proc/STORJPROCESSID/fd, using the main process ID. I found only two lines with âblobsâ that seemed to match what I look for in the lazy version.
So. According to this, it would seem itâs still working on aa.
And itâs been on aa since I restarted the node. About 30 minutes now.
Itâs even slower than the lazy version.
Utterly doomed.
EDIT: Couple hours later, itâs on ag. Down to 611g free space on the disk.
So so doomed.
EDIT 2: So just as a test, I shut down my other process, to see if this non-lazy walker would get any faster. Nope, at least not in any perceptible way. Stuck on âamâ for at least 15 minutes since I did that. I seriously think the lazy was faster.
As far as this advice, of moving the databases and letting them be recreated⌠wouldnât it still need to do a full filewalker run in time for that to help?
Yeah, I donât think that would help then. Weâve confirmed my databases are fine (at high cost). The issue now is there is no way to get the filewalkers to correct the incorrect stats soon enough to prevent it from trying to write to a completely full disk in, I estimate, around 18 hours.
Iâm sorry about that. Then you need to adjust the setup.
this is good, less points where to check.
So I would assume, that the CPU usage is not the case here, so only the disk is slow for some reason.
Iâm again sorry, but I have no other ways to check the databases integrity. Because resetting the usage always related to problems with a database (either malformed, or corrupted or locked).
but at least it doesnât generate a âdatabase is lockedâ errors, right?
Could you please check that? I would like to do it myself, but my very bad Windows setup doesnât have this issue to checkâŚ
We have a precautions checks there. The node check the actual free space on every upload. If there is less or equal than 5GB of free space either in the allocation or on the disk, the node will report to the satellites that itâs full, so uploads will stop eventually.
Likely yes, because it will run with a normal priority (affecting the uploads, but after the last changes the satellites will reduce an upload rate automatically, if your node would start falling success rate).
of course, it was not able to update databases last time.
Maybe. But why only this exact setup from thousands has this issue? Something is not revealed still.
I guess that for some reason the disk is not able to keep-up. Whatâs about S.M.A.R.T., is it ok?
It sounds from your error description that either the filewalker does not correctly update the databases or your databases cannot hold the values over a node restart. I would not call that fine. I have just restarted one of my nodes and could not observe the issue that you have mentioned:
All my values remain exactly as before the restart.
SMART is perfectly fine. I GAVE YOU THE SPEC SHEETS FOR THE DRIVE. It is a high speed enterprise class drive that I used for other uses for 2 years before putting on Storj. This drive can easily do 240mib/s+ 24/7. It is literally one of the best drives in the world for this.
I also noted that the non-lazy walker is NOT running any faster than lazy, in fact it seems to be going slower. But, yeah, good idea on checking for database locks.
And look at that! 20 database locks since I restarted with the non-lazy filewalker a few hours ago!
Must be my setup thatâs the problem.
It is obvious that there is literally no amount of information I can provide that will convince you that this is a perfectly normal node running on specs that exceed all recommendations configured by someone who has an EXTREME amount of experience with storage and has NEVER had anything remotely like this happen in any other of half a dozen other storage-based cryptos or other uses.
It canât be a bug on your end. It has to be my setup. No matter what.
Yes you gave it. I do not trust any Seagate, but itâs my personal problem.
However, your case only make me strong in my opinion about Seagate HDDs. But perhaps Iâm wrong and they actually good for someone. Not for you and me seems.
this is interesting result, I wouldnât call it as expected.
Then I think our experiment is over. You shouldnât have any.
In this exact case the only solution is to move databases out of this disk.
I have over 20 other of this exact same model of drive. I showed you a screenshot of several of them easily running over 240mb/s, 24/7. They have done so continuously for months. And I have similar drives doing the same in 4 other machines.
Amazing that the one out of 20 or so disks of this make and model, which easily passes SMART and all testing and evaluation under any other circumstance and load, is the only one with any problem. What are the odds, right?
This doesnât matter. It cannot keep-up, you convinced me. So the only solution is to move databases out of it to any other drive, which would not stuck to update them without being locked.
I have a couple other friends I (now regret having) convinced to join me in doing a storj node.
Earlier today, I asked them to grep for âdatabase lockedâ. They get those messages. Not as many as me, but quite a few a day.
You certainly seem to suggest that that should never happen. But 3 out of 3 are getting them fairly consistently. I just happen to be the one getting it the worst out of the 3.
Must be my fault.
There is no point in reporting bugs on this forum. Iâve done support forums for projects Iâve been involved in over the decades and I have NEVER been so dismissive of someone whoâs posted so much information confirming he has done everything possible to comply. Iâm so done.
yes, and itâs proven on my very not optimal setup. It doesnât have themâŚ
Unfortunately I cannot check that on RPi 3, because the SD card is not bootable and this setup in a different countryâŚ
But I cannot use my case as a template anyway.
Other 20k nodes can.
unlikely. Likely just this setup is struggling with the current load pattern. But you said that this problem exist for a while before all load tests are started.
So, just an unhappy case. Some combination of hardware/software.
Could you please try to move databases to another disk (preferably on your HBA or SSD)?
And if I take the trouble to do that (and Iâm not even sure how I would rewrite my docker run to split the storage files from the databases, is there a guide for this somewhere), and the problem continues, is there ANY chance youâll concede that thereâs a bug in play here?
Cause you were sure there was corruption before, I was sure there wasnât, I did what you said, all it did was prove I was right, and it killed my node.
If there is the remotest possibility youâll acknowledge thereâs a bug, then okay.
If there isnât, then nah.
(BTW, it amazes me, reading through these forums and seeing tons of people struggling with similar issues, that you can act like all 20k other nodes are running just dandy. I just told you 3 out of 3 users are seeing these database locks, and you didnât even blink.).
By the way⌠I know I seem very sure that itâs a bug and you guys may not understand why. Not sure if I mentioned it in my Database Locking Madness thread, but I just remembered why Iâve been so sure itâs a bug for months, and not a performance issue, all this time.
Because, randomly, each time I restart, it changes which database gets ALL the locking.
One restart, EVERT SINGLE database lock error, itâs piece_expiration.db.
Another restart, EVERY SINGLE LOCK is on bandwidth.db.
Iâve been a programmer for the last 42 years. I know how to tell software bugs from performance issues.
If it were performance issues, both databases would be getting affected on any given run.
But it never is. Itâs always one of those databases, or the other. Every single lock affects only one on any given restart of the node.
You tell me. Is that consistent with a performance issue? Or a software issue?
Feel free to blame it on a sqlite3 bug. But I know how to vet shit and my shit is vetted.
I do not see a bug, only the problem with this particular hardware, sorry.
I would say, that losing the collected stat while the DB is not able to store them I would personally consider as a not great behavior. The bug here is the hardware, which we cannot fix. However, the parameter to let the node know, thatâs we are ok to grow in RAM indefinitely in hope that some time in the future the disk wouldnât be so busy to do its work which it was designed for, could be a temporary solution (RAM is not infinite anyway).
which, again, proves that there is a bug in this particular setup. Because almost anyone must have this âbugâ, which is not the case, Iâm sorry. Even raspberry Pi doesnât have it, @ACarneiro can confirm.
I see. So itâs a HARDWARE bug on my end that causes every single lock to be on one database, or the other database, on any given run.
HARDWARE can do that. That doesnât sound like software. It must be HARDWARE that reflexively chooses only one sqllite3 database to lock up on, randomly, each time the node is started.
This is not the problem with you. Please try to understand, some HW works way outside of the standard expectations. Itâs not your fault. The HW has some issues. We cannot fix them (at least I do not have enough knowledge how to fix that the disk is too slow on writes without adding an additional layer of complexity like an SSD layer).
Please, try to move databases to another disk. Please. If itâs hard, ok. Leave it as is. It will not going to solve itself likely.
Explain to me how a disk being slow on writes could possibly only be slow on one database one run, and only slow on a diffrerent database the next run.