Fatal Error on my Node / Timeout after 1 min

daki82 · April 2, 2023, 3:48pm

i noticed something in the “trash” file of the node
(because it was terrible slow (not to access but to gather the file size of the contents.))

there are 6 folders in wich are 1024 folders with 2digit name. 3 of the six folders take real long when i wanted to know how big they are.
got suspension on 3 satelites today and the “timeout” error.
can this cause the drive to take long when trash -not garbage- is deleted?
this was hidden from the long downtime a week ago.

my guess here:
1 the satelites start trash deletion/move totrash en mass.
2.the node tries to move pice to trash and/delete from trash (simultaneous?) thousands of pieces. needs more than one min-and times out. since 1.75.2
3. other trash options are canceled, causing “can not move to trash” etc. visible in log. at same time.

before 1.75.2 normal r/w fills cache and ram, gets done maybe some min. later incl. audit.
everybody knows windows is bad at deleting moving lots of small files on cmr disks.

@Alexey and @Knowledge explains everything like “your disk has problem” “and only windows” “no audit problem” or more questions?

could it take to long to calculate the used space of trash ?

Knowledge · April 2, 2023, 11:03pm

After 141 responses from different people with different issues, I’m not sure what the issue we are focusing on is.

Does deleting things out of trash or counting used space cause audit failures? I would think the problem would be more widespread if that was the case. I mean, if you have five nodes sharing one drive in five different VM’s and they start doing heavy drive work, it would seem logical that the drive can’t keep up with all of the virtual machines. And so one of them is low man on the chain and gets the most wait states and times out. The larger the data grows, the longer these processes take, the more your nodes will suffer performance issues.

Now sure, you might have a high speed array that can handle fifty connections, I dunno. But generally speaking it’s not a good idea to segment a single drive into multiple nodes. File walker, deletes, or heavy traffic will cause it to take a beating.

jammerdan · April 2, 2023, 11:15pm

But it won’t get better.
I mean this is an interesting phase with lots of ingress and nodes growing.
We are seeing SNOs purchasing 20TB disks to be able to offer more space.
Also we should expect more customers and additional satellite operators.
This will lead to heavier garbage collections, file walkers and I don’t know what else.
So maybe additional optimizations for heavier usage scenarios might become necessary.

Knowledge · April 2, 2023, 11:49pm

If end users upload 1k files in the millions, it is difficult for any file system to manage the ops on so many files. Perhaps they could store smaller data inside a table instead of the file system. That may be more optimal for indexing but would still require drive ops. Although one could overwrite expired records rather than delete them which would eliminate a significant portion of overhead.

Larger segments would still be better served on the file system. SQL Server has a feature called FILESTREAM that links the records with the file system. So you can either store data in the table directly varchar(max) or use FILESTREAM to keep it on the drive but linked to the table. Would be very quick to launch batch commands to gather properties or conduct operations in batches.

jammerdan · April 3, 2023, 12:15am

There is probably still tons of things that could be done codewise for improving performance I don’t know.
From what I remember (I don’t know if it has been implemented or not) were suggestions to run things like the filewalker with lower priority for example. In general it sounds like a bad idea to run such things when the load is high anyway.
The other thing that has been mentioned was (again I don’t know if it is the case) that file deletion is a copy operation instead of a move causing more iops than necessary.

Generally I have to repeat that a SNO has no influence on the load scenarios nor can he forsee what load will hit his nodes. And therefore I believe the node software would be required to handle adaption to different scenarios dynamically. Like for example when a node performance gets worse, reduce uploads.

pcresumen · April 3, 2023, 7:30am

Hello again, after spending a couple of days leaving the node as it was before the errors, after 24 hours, it started again with the error:

(FATAL Unrecoverable error {“error”: “piecestore monitor: timed out after 1m0s while verifying writability of storage directory”, “errorVerbose”: “piecestore monitor: timed out after 1m0s while verifying writability of storage)

It doesn’t matter how much waiting time you put on it, because the error will happen either in 1 hour or after 6 hours.

I have observed that after about an hour and a half, with the hard drive at 5-20% usage there is a moment when the service goes to 100% CPU, sometimes it lasts a few seconds (5-10 ), and others spend more than 1 minute and cause the error. It must be said that everything is blocked (file explorer, access to hard drives, etc.) until the service gives the error and the CPU goes to 0%.

I have changed intervals as indicated by @Alexey in the config.yaml without success, so I preferred to leave everything as it was by default. I have also tried the different connections without success: USB 3.0 & E-SATA

The solution that I am applying to prevent the node from failing and not getting suspended… is to indicate that the node is already full in the config.yaml. So there is no error and it works correctly. This node is 6 months old and has not given any problems until 03/24/2023. It is under the Windows 10 operating system, updated and only provides the node service.

What I don’t understand is that if before I had everything 100% (suspension, audits, online) and it worked perfectly, now these verifications have been applied that collapse the node and the service crashes. This endangers the integrity of the node.

I would like to avoid having to migrate the node to more powerful hardware in order to maintain the service.

daki82 · April 3, 2023, 7:59am

i take a roundabout guess on the ops problem based on my 6tb 60%full node:

3 folders adding to low 3 digits gigabytes of data. (i ignore the just MB big folders)
3x 1024 subfolders each containing ~170 files (from them there are maybe 15% 2mb big
but 60%are less than 10kb) also just 1/7 to delete a day…

ohhhh that would be 45000 small files to delete a day (plus the same amount to move to trash) in less than 1 minute
my guess is this could be the problem connected to the timeout.
(if im not dead wrong)

ext4 maybe in favor of this, so linux node don’t have the problem YET.

also as far as the files of the enduser get split in minimum 30 pieces-> each file smaler than 120kb results in less than 4kb pieces, even 240kb make a mess of small files.

daki82 · April 3, 2023, 8:23am

as until now i have no audit errors, suspension gets bashed 5% om busy satelites and online slowly goes down.

audits come also in times when the node is up again as the 3x5min approach can be served better than the 1min timeout. i think

also if i had simultaneous trouble with 141 of ~20000 people after an update
and that are only the non-disqualified careing people making an forum acc for this, so the dark numbers are still higher.
i would rollback quicker than 1 minute, and releasing feature after featute to test wich one is the problem. (timeout/trash/FS kombo) or others.

Alexey · April 3, 2023, 8:51am

I saw such behavior when something is tried to write or read data from the sector with a bad block - the whole system hangs for several seconds (up to 5 minutes) and even mouse start to move with jumps. So it looks very suspicious. Did S.M.A.R.T. detect something?

Alexey · April 3, 2023, 9:00am

25 users actually participate in this post and also I have 1 support ticket with the similar problem.
We need to figure out, what causing disk to response slower than usual. Also if there is a bug, I would expect that all @Vadim’s nodes should have the same issue.

Vadim · April 3, 2023, 9:18am

I had something like this several times on some nodes, but last 3-4 days are OK I monitor situation, all the time because today ingress raised 1,2-1,5 times compared to weekends.

pcresumen · April 3, 2023, 9:52am

The mouse moves, I can connect via TeamViewer, I can use taskmgr to kill processes, etc. Even though the cpu is at 100% (50-70% is the storage node service), but it won’t open a file explorer because cannot access the hard drive. (Sorry if I didn’t explain myself correctly.) If I kill the process with taskmgr I can again access the hard drive, copy, paste, etc.

I have done a chkdsk /f /r and the disk is healthy, 0 errors and with the crystaldiskinfo application there are no errors either. It’s more in this application I have activated that when it detects a hard disk error I am notified via E-mail to avoid a disaster and lose the entire node.

If you ask about the temperature that the disk reaches, it is between 36-38ºC when the disk is at 5-20% use and 40-42ºC when it is at 100%. The life time that the disk takes is 2000 hours, it is relatively new .

Antivirus has the norton, with the exceptions that it does not scan the disk where the node is located and that the application is fully trusted. (I have also tried with Norton disabled and it behaves the same)

Do you want me to do a specific test?

thepaul · April 3, 2023, 4:46pm

I’m adding an option that will make the node log an error when the verification checks fail, rather than killing the node. If your main problem is that your node keeps dying because of the verification check and you don’t want it to do so, this option will help.

https://review.dev.storj.io/c/storj/storj/+/10073

If you enable this option and your node continue to lock up while performing the check, that will suggest something is wrong with the nature of the check itself.

Maksvell · April 3, 2023, 5:32pm

So any updates? I have still the same issues, every couple hours storj node service is down.

heunland · April 3, 2023, 6:21pm

please refer to Fatal Error on my Node - #158 by thepaul

Dave-Baldwin · April 3, 2023, 6:47pm

In the meantime, you can downgrade to 1.74.1 and the issues goes away.

Vadim · April 3, 2023, 6:59pm

Do you have any other HDD on this pc please check all of them, i have seen that one broken hdd put lagging whole pc.

pcresumen · April 3, 2023, 7:20pm

Right now I only have 2 HDDs:

SSD: 460GB for the operating system
HDD: 6TB for node

RAM memory: 8GB

Both disks are perfect, 0 errors in chkdsk and crystaldiskinfo

I have another 12TB hard drive (it is not connected to the PC) that I have waiting for the node to fill up, to expand or create a new node.

I can try to migrate the data from the 6TB drive to the 12TB one, but it will take a few days…

Or, I can create a partition with the SSD with 64GB dedicated to caching with PrimoCache.

I can also leave the node indicating that it is full through the config.yaml and wait for the new version to see if it stops collapsing

Vadim · April 3, 2023, 7:34pm

Never cache on OS SSD, just never, it will kill it and kill windows performance.
what disk model do you have in node?

daki82 · April 3, 2023, 8:04pm

how ? is there an manual?