Support for problematic node

Alexey · July 27, 2024, 4:10am

These ones:

support26:

2024-07-25T09:48:48-04:00       ERROR   piecestore:cache        error getting current used space:       {"error": "filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled; filewalker: context canceled",
"errorVerbose": "group:\n--- filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces:74\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkAndComputeSpaceUsedBySatellite:79\n\tstorj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatel

support26:

2024-07-25T09:48:48-04:00       ERROR   retain  retain pieces failed    {"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatelliteP

support26:

2024-07-25T09:48:49-04:00       ERROR   retain  retain pieces failed    {"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "error": "retain: filewalker: context canceled", "errorVerbose": "retain: filewalker: context canceled\n\tstorj.io/storj/storagenode/pieces.(*FileWalker).WalkSatelliteP

however the last line was

so there was no indication of a crash. But if your service is stopped, then it was.
When the node crash it’s usually post a reason to the log. There are should be Unrecoverable and/or FATAL errors.
Please search for these errors in your logs and post the recent one.

support26 · July 27, 2024, 2:23pm

I do not know if there were.
How do I do that?

Julio · July 28, 2024, 3:01am

Hmmm… drive is acting reasonably well, I presume those were IOPs you posted. Regardless of node running or not, that’s ok, 2 or 3 yr old WD or Seagate 8-10TB, kinda yanky on the read I/O, so it’s not some new enterprise hdd, or whatever.

Check your Windows system logs, for ‘service stopped unexpectedly’ errors, it has done this N times., etc. With regard to your Powershell not running correctly, and lack of FATAL in your logs, seems your storagenode.exe is being killed off n’ that also points to overtaxed memory management.

You gotta suspect a system issue, if you can’t even run a simple Powershell command… Obviously. And you really should regurgitate your system specs, and provide full details in this thread. Anyone coming across this thread has nothing to reference.

1/8 of cent

Alexey · July 28, 2024, 3:10am

Try to run in the regular PowerShell, not as an administrator

Select-String "Unrecoverable" "C:\Program Files\Storj\Storage Node\storagenode.log" | Select -Last 10

If you would still receive a >> prompt (which usually indicates the missing quote), then I would suggest to try to update a PowerShell to the latest version.

support26 · July 28, 2024, 5:24pm

Alexey:

Try to run in the regular PowerShell, not as an administrator
Select-String "Unrecoverable" "C:\Program Files\Storj\Storage Node\storagenode.log" | Select -Last 10
If you would still receive a >> prompt (which usually indicates the missing quote), then I would suggest to try to update a PowerShell to the latest version.

I do not recieve a “>>” prompt.

support26 · July 28, 2024, 5:47pm

Yes, that is what was posted.

Nothing relative search found for those evenids. My powershell is not running incorrectly. There are (perhaps were) FATAL errors in the logs. Only every see the memory useage at about 1/2 use. Is there a specific process to check the memory amangement for STORJ?

I don’t, as the increased issues are only arising after diredtions from the forum.

CPUS : Intel Xeon (E5540 @ 2.53 (x2)
MEMORY: 72 GB
OS: Windows Server 2019 (64-bit)

Alexey · July 29, 2024, 4:32am

but also didn’t return anything about Unrecoverable errors I guess?
If so, please search for FATAL ones

Select-String "Unrecoverable|FATAL" "C:\Program Files\Storj\Storage Node\storagenode.log" | Select -Last 10

support26 · July 29, 2024, 2:58pm

C:\Program Files\Storj\Storage Node\storagenode.log:15017048:2024-07-24T03:02:34-04:00  FATAL   Unrecoverable error    {"error": "piecestore monitor: timed out after 1m0s while verifying writability of storage directory", "errorVerbose": "piecestore monitor: timed out after 1m0s while verifying
writability of storage directory\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2.1:175\n\tstorj.io/common/sync2.(*Cycle).Run:160\n\tstorj.io/storj/storagenode/monitor.(*Service).Run.func2:164\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
C:\Program Files\Storj\Storage Node\storagenode.log:17257046:2024-07-24T21:42:18-04:00  INFO    piecestore      upload started                                                                                                                                                                     {"Piece ID": "ZZUXCFATALWYWP22WEXVPSTSDYEASA5Q75HDCYHIITHNK6BKWYYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote
Address": "79.127.219.35:33684", "Available Space": 2699265506872}
C:\Program Files\Storj\Storage Node\storagenode.log:17257088:2024-07-24T21:42:18-04:00  INFO    piecestore      uploaded                                                                                                                                                                           {"Piece ID": "ZZUXCFATALWYWP22WEXVPSTSDYEASA5Q75HDCYHIITHNK6BKWYYA", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Action": "PUT", "Remote Address":
"79.127.219.35:33684", "Size": 4203520}
C:\Program Files\Storj\Storage Node\storagenode.log:18944438:2024-07-25T09:33:54-04:00  FATAL   Unrecoverable error    {"error": "Error opening database on storagenode: database: piece_spaced_used opening file \"C:\\\\Program Files\\\\Storj\\\\database\\\\piece_spaced_used.db\" failed: context ca
nceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:364\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openExistingDatabase:341\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:316\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:281\n\tmain.cmdRu
n:65\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/commo
n/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78", "errorVerbose":
"Error opening database on storagenode: database: piece_spaced_used opening file \"C:\\\\Program Files\\\\Storj\\\\database\\\\piece_spaced_used.db\" failed: context canceled\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabase:364\n\tstorj.io/storj/storagenode/storagenodedb.(*DB)
.openExistingDatabase:341\n\tstorj.io/storj/storagenode/storagenodedb.(*DB).openDatabases:316\n\tstorj.io/storj/storagenode/storagenodedb.OpenExisting:281\n\tmain.cmdRun:65\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411
\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.
ExecWithCustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78\n\tmain.cmdRun:67\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:393\n\tstorj.io/common/process.cleanup.func1:411\n\tgith
ub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tstorj.io/common/process.ExecWithCustomConfigAndLogger:77\n\tstorj.io/common/process.ExecWith
CustomConfig:72\n\tstorj.io/common/process.Exec:62\n\tmain.(*service).Execute.func1:107\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"}
C:\Program Files\Storj\Storage Node\storagenode.log:19027532:2024-07-25T10:07:25-04:00  INFO    piecestore      upload started                                                                                                                                                                     {"Piece ID": "S7QTE4PE7PUJFK4IFATALFMW3J74R3D223DEUXWYZXGOKASR5KAA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote
Address": "79.127.219.44:46638", "Available Space": 2604476126188}
C:\Program Files\Storj\Storage Node\storagenode.log:19027546:2024-07-25T10:07:25-04:00  INFO    piecestore      uploaded                                                                                                                                                                           {"Piece ID": "S7QTE4PE7PUJFK4IFATALFMW3J74R3D223DEUXWYZXGOKASR5KAA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address":
"79.127.219.44:46638", "Size": 247296}
C:\Program Files\Storj\Storage Node\storagenode.log:24829455:2024-07-26T23:11:00-04:00  INFO    piecestore      upload started                                                                                                                                                                     {"Piece ID": "RMFBFLHSNUHLPU3FATALULJ4VTQUXMOAZZQQCRGMDF42IMGZ7RPQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote
Address": "79.127.205.234:37904", "Available Space": 2171213621362}
C:\Program Files\Storj\Storage Node\storagenode.log:24829456:2024-07-26T23:11:00-04:00  INFO    piecestore      uploaded                                                                                                                                                                           {"Piece ID": "RMFBFLHSNUHLPU3FATALULJ4VTQUXMOAZZQQCRGMDF42IMGZ7RPQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address":
"79.127.205.234:37904", "Size": 36864}
C:\Program Files\Storj\Storage Node\storagenode.log:25634653:2024-07-27T05:31:56-04:00  INFO    piecestore      upload started                                                                                                                                                                     {"Piece ID": "BFATALX4ZKZTIK5GWJQEATKX2AOMIE2JO4VWSSC6UWTVNMLKNYCQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote
Address": "109.61.92.75:38382", "Available Space": 2031254319670}
C:\Program Files\Storj\Storage Node\storagenode.log:25634655:2024-07-27T05:31:56-04:00  INFO    piecestore      uploaded                                                                                                                                                                           {"Piece ID": "BFATALX4ZKZTIK5GWJQEATKX2AOMIE2JO4VWSSC6UWTVNMLKNYCQ", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT", "Remote Address":
"109.61.92.75:38382", "Size": 768}

Roxor · July 29, 2024, 3:53pm

So, disk issues. If your node waited a minute and couldn’t write to a directory: the disk is broken/read-only or very slow.

support26 · July 29, 2024, 4:36pm

Read: 261.96
Write 1023.19

Nothing indicates disks are of those states. I am also not able to find specific performance requirements(HDD wise) but I have noted that these results(above) are more than sufficient.

arrogantrabbit · July 29, 2024, 5:27pm

What are those numbers?

Writability check will time out if your disk subsystem is overwhelmed with IO. HDD can only sustain around 200 iops.

You need to offload as much IO as possible. Review old threads on storage system optimizations. For example, you would need to disable sync writes, disable atime updates, any indexers and scanners, move databases to an SSD. That’s pretty much all you can do on windows, it’s quite a limited platform.

Alexey · July 30, 2024, 8:31am

You need to increase the writable check timeout on 30s,

This one is weird:

I would suggest to check the system disk for errors and fix them. If that’s SSD, it should not have context canceled errors (in this case this is mean that storagenode was unable to open this database). Could you please exclude the folder with databases from the antivirus scans?
Please also check databases for errors:

Julio · July 30, 2024, 10:03am

Nice kit S26, more than capable for this stuff.

No specific process, just in the details, the storagenode.exe may spawn it’s own child processes two or three times, the aggregate memory consumption will show in Taskman.

Julio · July 30, 2024, 10:16am

Other than the above noted replies…so far, I’d only additionally suggest you defrag that disk, run a disk check on your db drive. There is maybe one more parameter I can think of in the config.yaml which may help. Server versions of windows are easily blown away by high ingress - ie: if you’ve got a 1 Gb or + internet connection and being rifled test data. It will overwhealm your IOPs on a regular drive like that, plugging up ram and killing itself, as mentioned earlier. Look in the config.yaml for ____.write-buffer-size… note what it was, if not hash ignored, and set it to 128 KB, or 256 KB.

support25 · July 31, 2024, 1:34pm

As noted.

Meaning what?

As I have said to you, with the exception of the SDD, these have been done. Also note that the issue only has come up at the time of the problem, not prior.

support25 · July 31, 2024, 1:43pm

Where do I do that? What do you mean on 30s?

After what change??

Ok.

Has been done as already noted to you. They have never, in either locatoin, been included. How do I do that?

support25 · July 31, 2024, 1:46pm

Want to buy them? Turning out not to be the best support community and losing interest because of it.

That is why I have repeatedly said it’s not a performance limitation. Nothing like that occurring that I am able to see.

support27 · July 31, 2024, 2:00pm

Was not able to post from my account.

Already completed the disk maintenance. The servers are on a 500 connection. Never had ram maxed.

Only have this… already at 128 but hashed out.

in-memory buffer for uploads

filestore.write-buffer-size: 128.0 KiB

arrogantrabbit · July 31, 2024, 4:04pm

You need to stop creating new account. WTF? If you can’t post, maybe there is a reason for that? Just think about it.

I don’t understand you question.

This is also quite meaningless redundant statement.

Esther stop trolling and wasting everyone’s time, or if you actually need a solution — do a little bit of homework. Don’t expect to be spoonfed on every step of the way.

Show proof. Show disk IO queues.

lol. He/she is losing interest.

Knowledge · July 31, 2024, 5:03pm

Things like this are clearly explained in the previous discussion that the quote is from. I think you are impatient and want things done for you, but running a node requires a level of technical understanding that may be outside of your current skill set. Either that or you need to slow down, read what is being clearly asked of you to do and work with the people helping you instead of against them. It would help if you communicated in more detail and didn’t respond like you are sending text messages.

We are happy to help you get your nodes working properly (or at least as well as everyone else’s currently are) but we need to work together and reduce the friction.