Two weeks working for free in the waste storage business :-(

Roxor · July 11, 2024, 1:17pm

It sounds like it defaults to the data path and not the DB path. But (at least in Docker) it’s easy to override.

Toyoo · July 11, 2024, 10:52pm

The source code of the bloom filter generator is here: storj/satellite/gc/bloomfilter/observer.go at 31815729a40787a7e7dd5836268179d24cd9e6dd · storj/storj · GitHub

It’s not actually that complex, it’s a pretty straightforward implementation. “Simple” is easy to make correct… which is probably quite important not to accidentally kill pieces still in use. Though, “simple” is also usually “not optimized”.

I don’t know whether the infrastructure configuration for this process is public, I suspect it is not.

Alexey · July 12, 2024, 8:39am

The first one - it’s could be.
The second one - is a consequence, that the GC has been collected this piece and sent it to the trash before the TTL collector can delete it directly. Kind of expected for the trashed TTL data.

Alexey · July 12, 2024, 8:40am

Nothing fishy, just an another bug. Unfortunately.

Vadim · July 13, 2024, 1:53pm

do we need to turn on this cache separately or it is by default on?

elek · July 13, 2024, 2:31pm

This is an experimental feature, not yet stable (I didn’t test if for multiple days on dozens of storagtenodes, what is the plan).

The problem with the filewalker is getting file size and file modification. The first one is required for calculating used size (which is required to estimate how many more data you need). Modification is required for safe delete and TTL.

The problem is that we need a file system stat call to get them. This is slow, as they requires a disk read (an inode read on linux, from different disk locations). Storing these information in a central location (cache) can make this process faster, with minimal overhead.

That’s the plan.

Vadim · July 13, 2024, 5:31pm

Do i need to turn it on separately or it makes cache by default?

Ruskiem · July 13, 2024, 5:40pm

I thought such cache or file database should already exists somewhere in node’s .db files?
Oh it did not? so the node always need to iterate all files.
i mean, doesn’t satellite knows it already? (which files the node has and which size, and when sent and accepted by the node) that way the payments are always accurate despite what node’s data says. Whats needed is only to make audits, to make sure those files are still on nodes, and not computing all that constantly from scratch on limited I/O nodes side.

Why then forcing poor HDDs to iterate all that files for days/week which is robbing disk from it’s I/O. Its just blocking the node to receive new files.

i mean why such cache should exists on the node side, if satellite already knows all that.
Why forcing the node to colecting all the data over and over again, Even if i run full used space filewalker, then after 1 week the composition of files changes, (even drastically with the new pattern with 30 days TTL) so the node need to do another full filewalker to update the cache, isn’t that insane?

If You need the cache locally, well a database of files, why just don’t download it form the satellite?
With all the files node SHOULD have, and size of all the files, and timestamps from the point a node accepted it. And proceed on it, and IF somehow the node would not be able to process on the files, like file missing, that would indicate that node lost file, so You have a free audit simultaneously! How about that?

Starting to understand arrogantrabbit here

Maybe not that much so dashboard has to go, but some simplicity needs to be done, maybe like this idea freshly posted:

elek · July 13, 2024, 6:48pm

Because satellite couldn’t share the information, because it’s too big. That’s the reason of using bloom filters instead of sending a list which supposed to exist…

There are multiple ways to solve this problem, each has strengths and weaknesses. For example HDFS has sg like an opposite direction (SNs are reporting back the stored data to satellites). Because it was created to solve the problem of big files (at Yahooo… long time ago…). And it had lot’s of problem with small files… (one improvement was to adopt incremental piece reports…)

Alexey · July 14, 2024, 8:51am

no it doesn’t. The satellite have an information about the segments and pointers (to the nodes), that’s all.

It’s based on a completely different data - on orders, signed three sides: the satellite, the customer’s uplink and the node.

to have a pretty numbers/graphs on the dashboard mostly. And so that the node can quickly report how much free space it still has.

it doesn’t have this info. It has an information, which is provided by the node, including a reported free space and the signed orders. No info about pieces on your node. They can be calculated (when the auditor is requesting it), but in general, the satellite doesn’t store the details, which can be calculated.

Zetanova · July 17, 2024, 7:15pm

@elek I made my own small research about the filewalker and ext4 & co issue for the last past days and got related to your commit. Best filesystem for storj - #89 by Zetanova

Even your badger cache code will not solve the underlaying issue of directory fragmentation.

A refactor of the directory layout would be required including your badger as an index-cache.

My idea would be to change the “./config/storage/blobs/” directory structure from a hash-set to a journal. This requires to include a continuous directory like a date into the
path. The “./config/storage/trash/” already makes it similar approach.

This will force new writes to be put near together and in the same inode structure
and will reduce the directory fragmentation. The downside would be that a direct access over the chunk hash will not be possible and this would be where your badger cache would be used as the index and file path locator.

Example:
instead of “/config/storage/blobs/6r2…aa/a2/fj4…cq.sj1”
following path could be used:
instead of “/config/storage/blobs/{yyyy-ww}/{bucketId}/fj4…cq.sj1”

The chunk hash, modified date and bucketId can be looked up over the badger db
and located on disk. The BucketId would be counter to only to store a max count of chunks files inside a directory.

The badger db could be recreated at any time by firewalking, but this would be much faster because the directory inodes and files reads will always be near more sequential on the drive.

On node startup only the lastest directory “/config/storage/blobs/{yyyy-ww}/{bucketId}/” need to be checked. Because only it could contain any unindexed changed files.

Mad_Max · July 18, 2024, 8:47am

I thought so too, but I checked it out - and it turns out partial canceled uploads are still written to the folder /blobs/ anyway. Even with a large RAM buffer size (for the test, I tried to run the node with the filestore.write-buffer-size: 2.5 MiB) which can accommodate any of the current pieces (because it exceeds its current maximum size), so that files do not even need to be temporarily written to disk. But they are stored and immediately turn into more of unaccounted uncollected garbage. As if there are lack of other sources of it.

And then they they are lying before deletion for at least a few more weeks necessary for the full cycle: next satellite backup ==> generation of Bloom Filters from that backup and their distribution to nodes ==> GC run on nodes to move it from /blobs/ to /trash/ ==> and + another 7-8 days before final removal from /trash/.

Fresh example from latest (running v 1.108.3) storagenode:

2024-07-18T10:56:29+03:00	INFO	piecestore	upload started	{"Piece ID": "JQDWSGC4L5LCCIXXOKF5BQX37W7EST3T6O4PRQAQ6ZWIFSCXZDMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.243.123:34658", "Available Space": 1967392485923}
---cut---
2024-07-18T10:56:33+03:00	INFO	piecestore	upload canceled (race lost or node shutdown)	{"Piece ID": "JQDWSGC4L5LCCIXXOKF5BQX37W7EST3T6O4PRQAQ6ZWIFSCXZDMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.243.123:34658"}

! note: i also search log by piece id (to make sure piece is NOT uploaded later again) and this is only strings found for this piece id

Let’s go to /blobs/ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa/jq/
where “ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa” is a storage dir for US1 satellites with id “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”
and /jq/ is sub-dir(prefix) taken as first two chars from piece ID.
And guess what we found there?

dir D:\Storj_Data\blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\jq\dwsgc4l5lccixxokf5bqx37w7est3t6o4prqaq6zwifscxzdma.sj1
 Folder content D:\Storj_Data\blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\jq

18.07.2024  10:56         2 319 872 dwsgc4l5lccixxokf5bqx37w7est3t6o4prqaq6zwifscxzdma.sj1
             1  file      2 319 872 bytes

Fresh portion of 2.3 MB of unaccounted garbage waiting next few weeks for next BF+GC+TFW runs!
I checked it few times more - its not a single glitch but usual (can not call it “normal”) behavior of current storagenode. Garbage generation currently is “by design”.

Vadim · July 18, 2024, 9:18am

I all people reports will look like this, today there would be much less problems.
@littleskunk I thin Mad_Max report cold help reduce garbage amount, or even why there is wrong size of nodes.

jammerdan · July 18, 2024, 9:33am

Mad_Max:

Fresh example from latest (running v 1.108.3) storagenode:

2024-07-18T10:56:29+03:00	INFO	piecestore	upload started	{"Piece ID": "JQDWSGC4L5LCCIXXOKF5BQX37W7EST3T6O4PRQAQ6ZWIFSCXZDMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.243.123:34658", "Available Space": 1967392485923}
---cut---
2024-07-18T10:56:33+03:00	INFO	piecestore	upload canceled (race lost or node shutdown)	{"Piece ID": "JQDWSGC4L5LCCIXXOKF5BQX37W7EST3T6O4PRQAQ6ZWIFSCXZDMA", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Action": "PUT_REPAIR", "Remote Address": "5.161.243.123:34658"}

Mad_Max:

And guess what we found there?

dir D:\Storj_Data\blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\jq\dwsgc4l5lccixxokf5bqx37w7est3t6o4prqaq6zwifscxzdma.sj1
 Folder content D:\Storj_Data\blobs\ukfu6bhbboxilvt7jrwlqk7y2tapb5d2r2tsmj2sjxvw5qaaaaaa\jq

18.07.2024  10:56         2 319 872 dwsgc4l5lccixxokf5bqx37w7est3t6o4prqaq6zwifscxzdma.sj1
             1  file      2 319 872 bytes

This looks concerning. With the huge amounts of test data uploaded and with lots of uploads cancelled, this could result in massive space occupation for canceled uploads.
And they wouldl take days if not weeks to get cleaned up.

These BGT-runs are killing us.

I don’t know if I should laugh about that or just cry…

Julio · July 18, 2024, 9:48am

So noted … It’s not just the choice of 2, it’s the choice of Zool! The beatings will continue until moral improves! lol Those with poor performance, having abominable success rates, are rapidly crowding the mosh pit…all the while the main concert succeeds in throughput. Everyone try not to get trampled under foot! - Check your success rates.

2 cents.

Vadim · July 18, 2024, 9:48am

to be honest, you dont need to do anything, problem reported. I understand that every SNO need to write his disappointment about this problem, but this is the main problem, devs cant read this all and make work at same time. So if you have technically something to add to help investigation and rehear this problem you are welcomed. It is not directed to only to you it is to every one.
Sorry it is may be little offensive, but i do not know how to describe it in soft way.

littleskunk · July 18, 2024, 9:55am

Please make a github bug out of that. I know on most of your previous bug reports you haven’t received a lot of feedback yet. I am going to address that internally. In the meantime don’t give up and keep the bug reports coming.

Mad_Max · July 18, 2024, 10:33am

I can do it a bit later. But is this really a bug to report, and not intentional behavior?
I may be wrong, but once upon a time (last year it seems to me) I read a discussion on github or here on forum that saving canceled uploads was intentionally left and it was (then) necessary for something. I don’t remember the details now, probably just another “precautionary measure” if something goes wrong. For example, the satellite will assume that the piece was successfully uploaded to the node, despite the fact that nodes itself considers the upload canceled or failed.
Because this becomes known for sure only after the satellite receives the сorresponding order for a piece. And this happens with a delay (up to 1 hour at the default settings).

littleskunk · July 18, 2024, 11:34am

I am not sure. In general I prefer to have one bug report to many than a missing one. We can still close the bug report if it turns out to be intensional. I would also argue that one way or the other the behavior (store piece on disk) doesn’t match the log message. So if the behavior is correct we could still improve the log message.

Toyoo · July 18, 2024, 11:51am

Last time I checked the code, for failed or canceled uploads the code is written to explicitly remove these files immediately (just like what would happen in the old ~~trash~~ temp directory)… unless the node is explicitly killed, in which case the code does not have any chances to execute. The latter is not supposed to happen too often… And if you observe something different, then this is a bug in the code.