Disk usage discrepancy?

jammerdan · June 23, 2024, 8:26am

This idea came from the fact that we obviously discard database entries that can’t be written that’s one of the reasons why node numbers could be wrong.
So the original idea was instead of discarding, we should store and retry later.
But then, why not store them all in a journal and read that later when there is no pressure on the file system anymore.

Alexey · June 23, 2024, 8:31am

Yes, but it should be an append-only database in this case. Only in that case there likely wouldn’t be a congestion. Or a much much less probability to have it.

I’m not sure, that would be a great behavior. This requires basically to stop any activity of the node like it’s put to be offline. Perhaps the exit would be a solution. Then the docker daemon will restart it and SQLite would replay the journal. See, how bad is it look like?
I would suggest to keep to add records to the journal until the restart. It would happen every 2-3 weeks anyway due to upgrade. Yes, the node would start longer, but well. However, it can be a config parameter.

I didn’t get a confirmation from the team about this assumption. You, perhaps could be right and it’s discarded. I do not see that in the code, but it could be a side effect.

jammerdan · June 23, 2024, 8:46am

I don’t know anything about that. The main question is: Do we need to execute database statements really while IOPS are needed for customer data or could we somehow decouple that and do it later when the load on the file system is lower.

As said I don’t know if that would be required, as I can not tell how big such a journal would be and how much resources would be required to execute it.
But on the other side, if a node is full, this gets signaled to the satellite too, so I don#t see a big issue here.

Ok, I took that for already being confirmed.

Alexey · June 23, 2024, 8:52am

yes, we have to. Otherwise the usage will not be updated and the TTL data will not be recognized as a TTL (so, would be handled with a GC), and your dashboard will show the crap.
There are ways, and the simplest one is to add cache. Which requires RAM. The suggested implementation with only-append journals may work too, but I need a confirmation from our developers. Perhaps it’s not easy to implement or it doesn’t work like we both hopes.

I’m sorry if I gave you this feeling, but I’m not sure that it’s the case. I think it may finally drop this information, if the RAM is required to perform other things and the node was unable to flush it to the database before that.

jammerdan · June 23, 2024, 8:56am

That was not meant to be a general question if we have to do it.
It was meant to be the question if we have to do it at the same time while the pressure on the file system is high or if and how we could do that later when the pressure is low.

Alexey · June 23, 2024, 8:59am

Yes, I know, but the alternative is to keep it in a memory, and this likely may end with an OOM…

striker43 · June 23, 2024, 9:01am

Do you know the status of these experiments? It’s been a while since I first saw a post about it and didn’t hear a lot since then.

Alexey · June 23, 2024, 9:07am

Not quite much to share, but I saw an internal discussion about to implement it at least partially (it doesn’t cover all use cases right now without a penalty any kind).
I think it would be shared if there would be a success, even a partial. It’s in the PoC stage.
See:
https://review.dev.storj.io/c/storj/storj/+/13554
and

Toyoo · June 23, 2024, 3:43pm

You cannot reasonably expect there will be this “late when the load is lower” before the node actually needs to act on the information you expect to be stored in the database. You might also end up with a situation where updates are coming in faster than they are written to the database, and simply running out of cache.

This is a standard type of system design issue and the only two regular approaches are limiting the ingress (so that the node can actually write what it needs), or rearchitecting storage (so that writes are more efficient).

Toyoo · June 23, 2024, 4:17pm

Not sure if you have noticed, but recently we’ve got a lot of new users registering on the forum to complain about performance or wrong disk space reporting. I think this should count as silent majority waking up.

batelis · June 23, 2024, 4:33pm

I have one more question what happens when my hard fills 100% but node still has “free space” or thinks that it has more space? Just stops the upload? I checked my logs no errors for database btw so Im waiting for filewalker to do its job

Alexey · June 24, 2024, 3:50am

Yes, it will stop uploads around 5GB of free space in the allocation/disk. It’s a hardcoded limit when the node will urgently report to the satellites that’s full.

Ironcurtain · June 24, 2024, 1:42pm

My node still uses a lot more disk space then shown in the dashboard and my hard disk is now full. I used the command “storage2.piece-scan-on-startup: true” for over a week now but it has not fixed the problem. What can i do know? Do i have to clean the disk and start from the beginning?

ACarneiro · June 24, 2024, 1:44pm

Has the file walker finished running? If your disk is very big it can take many days… Mine is up to 6 days now (on an 18TB drive)

Ironcurtain · June 24, 2024, 1:50pm

Where can i see if the file walker finished?

ACarneiro · June 24, 2024, 1:54pm

All you can do is check the logs to see if the process is completed.
I’m not entirely sure I’m doing it right (I am not a techie), but all I did was

sudo docker logs node1 |grep used-space

and noticed that the process for satellite id “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE” was the only one that wasn’t showing “completed”. That is saltlake, so it makes sense.
Also, still a lot of IO on the respective HDD.

EDIT: and it started at 2024-06-17T15:36:18Z

snorkel · June 24, 2024, 2:08pm

I’m affraid to run a startup file walker scan for my filed up 21TB drives. I have RAM but still not enough. I think it will run for 2 weaks.

ACarneiro · June 24, 2024, 2:09pm

Yeah, I’m doing to try and fix the usage discrepancy but hoping I’ll never have to do it again!

EDIT: And on a Pi5 as well, with 8GB of RAM. Might be fixed by Christmas, maybe

StoreMe · June 24, 2024, 9:29pm

Hi,

i am hosting 2 nodes on 2 different HDD on the same machine (Linux, Docker). After shutting down my server for maintenance. I restart after 2 minutes and then restarted the Nodes.

But without any warning my first node is only showing nearly 50 % of the data that has been shown before my maintenance in the dashboard.

There is no error shown it looks perfectly fine but 50 % is not shown. And the free space in the dashboard is wrong. My HDD is nearly 100 % full with storj data. Nothing has been deleted. So if storj is trying to download and store data the HDD can’t store it.

So why storj dashboard is showing wrong data on my first node, where is the data gone ? Nearly no trash data only some GB but nothing in compare what has disappeared.

The second node is running perfectly fine without any isssue and all data that has shown before my maintenance…

Ironcurtain · June 24, 2024, 9:40pm

Hi, i used the command. The result:

root@Tower:~# sudo docker logs storagenode-v3 |grep used-space
2024-06-24T23:20:12+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess exited with status   {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "status": -1, "error": "signal: killed"}
2024-06-24T23:20:12+02:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6"}
2024-06-24T23:20:12+02:00       ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "121RTSDpyNZVcEU84Ticf2L1ntiuUimbWgfATz21tuvgk3vzoA6", "error": "context canceled"}
2024-06-24T23:20:12+02:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-06-24T23:20:12+02:00       ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "error": "context canceled"}
2024-06-24T23:20:12+02:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs"}
2024-06-24T23:20:12+02:00       ERROR   lazyfilewalker.used-space-filewalker    failed to start subprocess      {"Process": "storagenode", "satelliteID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "error": "context canceled"}
2024-06-24T23:29:25+02:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2024-06-24T23:29:25+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2024-06-24T23:29:25+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Process": "storagenode"}
2024-06-24T23:29:25+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Process": "storagenode"}
2024-06-24T23:29:26+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker completed {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB", "Process": "storagenode", "piecesTotal": 1094946816, "piecesContentSize": 1094447616}
2024-06-24T23:29:26+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess finished successfully        {"Process": "storagenode", "satelliteID": "12rfG3sh9NCWiX3ivPjq2HtdLmbqCrvHVEzJubnzFzosMuawymB"}
2024-06-24T23:29:26+02:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-06-24T23:29:26+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-06-24T23:29:26+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Process": "storagenode"}
2024-06-24T23:29:26+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Process": "storagenode"}
2024-06-24T23:29:30+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker completed {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo", "Process": "storagenode", "piecesTotal": 1119945216, "piecesContentSize": 1118760448}
2024-06-24T23:29:30+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess finished successfully        {"Process": "storagenode", "satelliteID": "12tRQrMTWUWwzwGh18i7Fqs67kmdhH9t6aToeiwbo5mfS2rUmo"}
2024-06-24T23:29:30+02:00       INFO    lazyfilewalker.used-space-filewalker    starting subprocess     {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-24T23:29:30+02:00       INFO    lazyfilewalker.used-space-filewalker    subprocess started      {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-24T23:29:30+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess Database started        {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}
2024-06-24T23:29:30+02:00       INFO    lazyfilewalker.used-space-filewalker.subprocess used-space-filewalker started   {"Process": "storagenode", "satelliteID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Process": "storagenode"}