Disk usage discrepancy?

naxbc · August 9, 2024, 6:54am

@Alexey…come on man
That’s like basic at this point, no?

Dr.Ko · August 9, 2024, 10:54am

curl localhost:5999/mon/ps
[2323569931365802481,7189491070577810988] storj.io/common/process.root() (elapsed: 172h9m19.419326777s)
 [6431948499252660292,7189491070577810988] storj.io/storj/storagenode.(*Peer).Run() (elapsed: 172h9m13.568752268s)

[6959841693586459393,2602390795943692092] storj.io/common/rpc/rpctracing./piecestore.Piecestore/Download() (elapsed: 1.125447561s)
 [2093920554374450887,2602390795943692092] storj.io/storj/storagenode/piecestore.live-request() (elapsed: 1.12541156s)
  [6451371452017218188,2602390795943692092] storj.io/storj/storagenode/piecestore.(*Endpoint).Download() (elapsed: 1.12539881s)

[1630876837993984714,7189491070577810988] storj.io/storj/private/server.(*Server).Run() (elapsed: 172h9m12.644519817s, orphaned)

[903625319944339965,7189491070577810988] storj.io/storj/storagenode/bandwidth.(*Service).Run() (elapsed: 172h9m12.643533647s, orphaned)

[5119260840370239031,7189491070577810988] storj.io/storj/storagenode/collector.(*Service).Run() (elapsed: 172h9m12.64601871s, orphaned)

[3008413974829738903,7189491070577810988] storj.io/storj/storagenode/console/consoleserver.(*Server).Run() (elapsed: 172h9m12.642904309s, orphaned)

[2429036044652063580,7189491070577810988] storj.io/storj/storagenode/contact.(*Chore).Run() (elapsed: 172h9m12.643608837s, orphaned)

[6419832077942457913,7189491070577810988] storj.io/storj/storagenode/forgetsatellite.(*Chore).Run() (elapsed: 172h9m12.640853342s, orphaned)

[8962183285788663938,7189491070577810988] storj.io/storj/storagenode/gracefulexit.(*Chore).Run() (elapsed: 172h9m12.641157836s, orphaned)

[34558424677826981,7189491070577810988] storj.io/storj/storagenode/monitor.(*Service).Run() (elapsed: 172h9m12.645434473s, orphaned)

[3445976527790545990,7189491070577810988] storj.io/storj/storagenode/orders.(*Service).Run() (elapsed: 172h9m12.643777802s, orphaned)

[8459771254874523922,7189491070577810988] storj.io/storj/storagenode/pieces.(*CacheService).Run() (elapsed: 172h9m12.645972185s, orphaned)
 [4823513664626300180,7189491070577810988] storj.io/storj/storagenode/pieces.(*Store).SpaceUsedTotalAndBySatellite() (elapsed: 172h9m12.64230631s)
  [917123383319018073,7189491070577810988] storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces() (elapsed: 16h45m10.183125722s)
   [5274574280961785374,7189491070577810988] storj.io/storj/storagenode/blobstore/filestore.(*Dir).WalkNamespace() (elapsed: 16h45m10.149983278s)
    [408653141749776868,7189491070577810988] storj.io/storj/storagenode/blobstore/filestore.(*Dir).walkNamespaceInPath() (elapsed: 16h45m10.149978553s)
  [7674976790325901763,7189491070577810988] storj.io/storj/storagenode/pieces.storedPieceAccess.Size() (elapsed: 10.488364ms)

[6059235424245186133,7189491070577810988] storj.io/storj/storagenode/pieces.(*TrashChore).Run() (elapsed: 172h9m12.64344642s, orphaned)
 [4663523655444128375,7189491070577810988] storj.io/storj/storagenode/pieces.(*Store).EmptyTrash() (elapsed: 172h9m12.576767164s)
  [9020974553086895677,7189491070577810988] storj.io/storj/storagenode/blobstore/filestore.(*blobStore).EmptyTrash() (elapsed: 172h9m12.576759289s)
   [4155053413874887170,7189491070577810988] storj.io/storj/storagenode/blobstore/filestore.(*Dir).EmptyTrash() (elapsed: 172h9m12.576738876s)

[2866598597612870667,7189491070577810988] storj.io/storj/storagenode/retain.(*Service).Run() (elapsed: 172h9m12.645342096s, orphaned)
 [274272495407841407,7189491070577810988] storj.io/storj/storagenode/retain.(*Service).retainPieces(storj.NodeID{0x7b, 0x2d, 0xe9, 0xd7, 0x2c, 0x2e, 0x93, 0x5f, 0x19, 0x18, 0xc0, 0x58, 0xca, 0xaf, 0x8e, 0xd0, 0xf, 0x5, 0x81, 0x63, 0x90, 0x8, 0x70, 0x73, 0x17, 0xff, 0x1b, 0xd0, 0x0, 0x0, 0x0, 0x0}, time.Time(2024-08-02T17:59:59.997079Z)) (elapsed: 27h39m23.641756681s)
  [4631723393050608709,7189491070577810988] storj.io/storj/storagenode/pieces.(*Store).WalkSatellitePiecesToTrash(storj.NodeID{0x7b, 0x2d, 0xe9, 0xd7, 0x2c, 0x2e, 0x93, 0x5f, 0x19, 0x18, 0xc0, 0x58, 0xca, 0xaf, 0x8e, 0xd0, 0xf, 0x5, 0x81, 0x63, 0x90, 0x8, 0x70, 0x73, 0x17, 0xff, 0x1b, 0xd0, 0x0, 0x0, 0x0, 0x0}, time.Time(2024-07-30T17:59:59.997079Z)) (elapsed: 27h39m23.641688592s)
   [8989174290693376010,7189491070577810988] storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePiecesToTrash() (elapsed: 27h39m23.64184057s)
    [3614782909912126299,7189491070577810988] storj.io/storj/storagenode/pieces.(*FileWalker).WalkSatellitePieces() (elapsed: 27h39m23.641551277s)
     [7972233807554893600,7189491070577810988] storj.io/storj/storagenode/blobstore/filestore.(*Dir).WalkNamespace() (elapsed: 27h39m23.641537539s)
      [3106312668342885094,7189491070577810988] storj.io/storj/storagenode/blobstore/filestore.(*Dir).walkNamespaceInPath() (elapsed: 27h39m23.641617877s)
    [485485719748090775,7189491070577810988] storj.io/storj/storagenode/pieces.storedPieceAccess.ModTime() (elapsed: 115.562744ms)

[1122406596424743509,7189491070577810988] storj.io/storj/storagenode/version.(*Chore).Run() (elapsed: 172h9m12.644567993s, orphaned)

As you said, I searched using the ps command. How can I know when it ends…!

Screenshot_2
Screenshot_3

Alexey · August 10, 2024, 4:01am

Yes, but sometimes people didn’t do that and many text editors (e.g. Notepad++) even wouldn’t notify you to save changes on close.

However, if the used space filewalker failed even in a non lazy mode, then your disk is overloaded, so you need to offload the load from it. One of the options - set the allocation below the current usage showed on the dashboard, this will stop any ingress to your node and the filewalker would have more IOPS.

naxbc · August 10, 2024, 2:03pm

Yeah, did that also; on a 16TB HDD I assigned 600GB. Accepting more suggestions.
Would this be fixed with 1.110 version?
Dashboard shows disk full, no space available for Ingress, so I assume it’s not an IOPS issue.

Alexey · August 11, 2024, 4:55am

If you have errors related to filewalkers, it’s an issue with not enough IOPS. This can be fixed either by optimizing the filesystem or by disabling the lazy mode and, perhaps, enabling the experimental badger feature to cache a filesystem metadata, see Badger cache: are we ready?

Could you please elaborate, what do you expect to be fixed in the next version?
There are many fixes, see Release preparation v1.110.

snorkel · August 11, 2024, 4:36pm

I’m not up to date with all the features that are worked on regarding the startup piece scan, but has been more than a month now with my drives reporting full because of the bugs, I can’t have a successful filewalker run on 22 TB of data (updates, power outages), and this affects my nodes because can’t get any usefull ingress.
The only way I see it and I will go on with it is:

GE Saltlake and than…
clean the test data with forget sat
remove all the leftovers, trash etc if that’s the case
do a FW run to get the nodes up-to-date.
There should be like 2TB of data on each after cleaning.

So… bye bye test data, I realy don’t see the advantages for my setups in keeping it on.
I read something about badger cache, but I don’t see the point in using it (maybe I don’t understand it very well), because I don’t plan to keep the startup FW on. And it shouldn’t be needed in a perfect world of storagenodes, with all the data flows accounted for and db-es updated correctly.

Alexey · August 12, 2024, 4:54am

There is a more simple way - you may disable the lazy mode and enable the badger cache, then restart the node. The used-space-filewalker should calculate the used space faster than in a lazy mode. You may also specify the allocated space below the usage reported on the dashboard - it will stop any ingress, gives more IOPS to the filewalkers.
When they would update a database, the used space should be reported correctly and you may increase the allocation back.

The badger cache would speedup any filewalkers, not only the used-space ones. However, it would start to be noticeable only after the first scan (or usual operations), because it will be filled with a metadata. Then the next requests would be handled by the cache not by the disk directly.

snorkel · August 12, 2024, 11:31am

Done all that. It finished in 7 days after some trash was removed, and the data shrinked to 14TB. But I GE all Saltlake nodes. I don’t need test data to keep me going.
I won’t activate badger… another db that can get locked or corrupted, and more debugging from me? No thanks. I have other things to do, than spending my time on debugging storagenodes.

naxbc · August 13, 2024, 10:01am

Well…fix what a lot of SNO’s are complaining: used space filewalker.
It can’t be a coincidence with a lot of SNO’s complaining about the same thing, no?

Alexey · August 14, 2024, 8:08am

Yes, all nodes have had a higher load than before (see Updates on Test Data) and more slow disk subsystems are now an issue.
And there are several solutions on the node’s side (optimizing a filesystem, adding more RAM or using an SSD as a cache layer, offload some loads to the SSD like databases, etc.), and also some improvements in the code (like a badger cache: Badger cache: are we ready?).
And of course, if some use the one disk for several nodes, now they know exactly why it was forbidden by ToS from the start. I do not say, that you do so, however, please, do not try.

naxbc · August 14, 2024, 9:45am

@Alexey, I follow the rules, I’m an experienced IT Technician and I operate Storj nodes since Rpi Signature Box I won on former Twitter contest.
Following previous posts about the same issue, this is ongoing for me for over 2 months; I tried all possible configs ( lazymode on and off, startup scan on and off in conjuction with both)…lowered the assigned Storage Space to 600GB and still I don’t see the “Overused Space” growing in the Dashboard even after 172 hours of node online!
Used Space Filewalker is NOT working properly.
To be honest it’s starting to wearing me out because no solution is given, just workarounds!!
Databases are on SSD, I have Storage Space on HDD 7200rpm CMR…it’s NOT too slow.
Also, a lot of people complaining about the same issues…tipically when there’s smoke, there fire!

Ring_Zero · August 14, 2024, 9:10pm

Where would I see the errors? and how to verify is the used-space-filewalker is running? Is there a doc for it? Thanks.

Alexey · August 16, 2024, 4:19am

Does the dashboard match this disk? I see the 600GB of allocated, but the avg graph shows about 7TB used (the last report from the satellites, not the Average Disk Space Used This Month), but your disk shows 7.01TB free from 8.98TB (Windows uses TiB but displays in wrong units), so the actual disk usage is 1.97TB

Do you have all filewalkers completed for all trusted satellites?

sls "\sused-space" "C:\Program Files\Storj\Storage Node\storagenode.log" | sls "started|completed|failed" | select -last 20

Also, do you have any database errors?

sls "error" "C:\Program Files\Storj\Storage Node\storagenode.log" | sls "database" | select -last 10

@Ring_Zero see above

naxbc · August 16, 2024, 6:51am

Nothing…and before you ask, yes, Scan on startup and Lazy walker are on.

No errors on Databases either:

Alexey · August 16, 2024, 7:19am

Not sused, it’s "\sused", the first \s is an important thing - “any space symbol here”. Logs may contain a space as a delimiter, but also tab and many other which could be treated as a space…

naxbc · August 16, 2024, 7:34am

Further more, if it helps:

Alexey · August 16, 2024, 7:57am

ok, so even in the non-lazy mode the disk is not able to keep up and the filewalker is failing.
Do you also have errors for filewalker with exit 1 even in a non-lazy mode?

Did you perform the NTFS optimizations?

naxbc · August 16, 2024, 8:05am

No, no “exit 1” errors.
Even bought 2 UPSs just to enable “write cache on disk”.

Alexey · August 17, 2024, 5:06am

The filewalker do reads, not writes. The write cache is helping to handle uploads with less IOPS (giving more IOPS to other processes).
Then you may try to use a badger cache, it could help. The lazy mode should be disabled.

Skyblockpro1 · August 18, 2024, 11:28am

Hey sorry to bother you, I am running it on windows and I am in the config.yml file and I am unable to find there to set the flag to false. I looked through the entire config file but couldn’t find any such option. I am most likely missing something, gladly appreciate if you would point me in the right direction.