Two weeks working for free in the waste storage business :-(

Mad_Max · July 11, 2024, 3:59am

I’m sure there’s another reason. Although what you wrote is also true, it does not apply in this particular case. Because the last of the processed BF for SLC on this node was created based on data BEFORE June 9 (more than a month ago), as can be seen from the logs above:

2024-07-07T14:37:09+03:00	INFO	retain	Prepared to run a Retain request.	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Created Before": "2024-06-09T20:59:59+03:00", "Filter Size": 6918625, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

Note - that latest big 17 MB filter for GC arrived only yesterday AFTER I noticed hundreds of thousands of erroneous requests to delete files that no longer exist by the TTL collector. So GC couldn’t remove anything before the TTL collector for SL. Simply because were no BF for SL for a log time.

So I think my first guess that the reason (or one of the reasons) is that the records of deleted TTL pieces are not being deleted properly from the database is correct.
This leads to massive repeated attempts to delete previously deleted files again.
To check this, today I opened and dumped the database piece_expiration.db (by the way thanks for your tutorial of sqlite3 use for db repair, i just slightly modified it) and found several million already expired records in it
As I actually was expecting.

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE piece_expirations (
						satellite_id       BLOB      NOT NULL,
						piece_id           BLOB      NOT NULL,
						piece_expiration   TIMESTAMP NOT NULL, -- date when it can be deleted
						deletion_failed_at TIMESTAMP, trash INTEGER NOT NULL DEFAULT 0,
						PRIMARY KEY (satellite_id, piece_id)
					);
INSERT INTO piece_expirations VALUES(X'af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000',X'5cffe70a5ff8af68da766af8e885a0344286fca4e8424bf3e80a8fc3891d7363','9999-12-31 22:59:59+00:00',NULL,0);
INSERT INTO piece_expirations VALUES(X'af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000',X'95d2fc24afa6c80ffea95ee846a4fbe4705e9f95d03c914ca5e44ee57a157aad','9999-12-31 22:59:59+00:00',NULL,0);
----------cut-------------------- (skip all records with  9999-12-31 timestamps about 20k of them)
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'83e28e948275d44dcd244e405828b412df6ebc082ed275d064d0a22532b595a4','2024-07-06 16:12:35.131048+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'eacbb2c6c516c41e867aa7584b3eb0e646090a03eb898515d6aa9dc8ef1e606e','2024-07-06 16:13:03.296976+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'3f381d23408a5e7f30ae15323642b08c702bd2f7888a8129b0ee1855aff9913d','2024-07-06 16:21:42.895462+00:00',NULL,0);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'a56315fcbaca89e6643dbcb4eecc6019e444f1146ad67476b0f0ddc9bb37770d','2024-07-06 16:22:31.145661+00:00',NULL,0);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'2adce981ff04c24da26a1b4be657a0d24b71a14f103201029a073daa48da296e','2024-07-06 16:22:34.327309+00:00',NULL,0);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'114db2673253d86fee8a16a96258b2c36cbee712659fbc34af22f9202de4f736','2024-07-06 16:22:38.648136+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'4912653437b098af37d2c0b2840d0cc6c4057569bb99c095cb244c49a14a7afc','2024-07-06 16:23:57.52352+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'e2a7159229ccd77f6c4abe84e8346ef52b1be8cb2491624534a03c95121575c1','2024-07-06 16:24:30.421535+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'589923128f48d07bd96490696bf6f167f17bf892415e8618389e3a5b50ffc9a5','2024-07-06 16:25:18.656935+00:00',NULL,1);
---------cut----------------- (few MILLIONS records with TTL timestamps before timestamp of DB dump) 
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'cd43e089023a35970680edb6aa5c9737fa4fc242f2597e8ac8ba318407dc7cdb','2024-07-10 23:42:46.360634+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'07fe5ac4aac528e7a746b4bdc7b3219b0342eb2d080468b518e0c8790a2d29fe','2024-07-10 23:42:50.80005+00:00',NULL,0);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'7c47f5a47a711d9c7e884080804b20c98510079eb8c62c1756f861835d863d57','2024-07-10 23:42:57.35328+00:00',NULL,0);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'12a1e0b36f9f4f37e67f8f3e8774e6cb3c864bd84254a9b60a31ba921f1417f0','2024-07-10 23:43:06.684222+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'396467b3fe9472e0c05f411dc7f2ce29f91435b08d60e992e4caa756bbb1116c','2024-07-10 23:43:08.859516+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'ae6954d11b3807ea1061838f397eb8ce75ab8b1328ad649056d78539528f10fa','2024-07-10 23:43:10.61353+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'54664b72169fdaa7b057c4c92b216b3b8c41621a474ffab7b9ca3bae328f46c1','2024-07-10 23:43:19.732135+00:00',NULL,1);
INSERT INTO piece_expirations VALUES(X'a28b4f04e10bae85d67f4c6cb82bf8d4c0f0f47a8ea72627524deb6ec0000000',X'18225b0412709785a5db89edc31a746f57a5615c6c50e0b40831dae581a824b7','2024-07-10 23:43:23.135277+00:00',NULL,1);

Alexey · July 11, 2024, 4:04am

My nodes received the BF for SLC too, but the date is more recent

2024-07-09T19:15:29Z    INFO    retain  Prepared to run a Retain request.       {"Process": "storagenode", "cachePath": "config/retain", "Created Before": "2024-06-30T14:28:54Z", "Filter Size": 6496965, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

Likely a vice versa, you likely have had WARNs from the collector (it’s a TTL collector), the GC one is called gc-filewalker and retain.

However, since you received a BF for the older date, then perhaps you are correct and TTL collector could be faster.
My nodes have had dozens WARNs related only to the collector, not gc-filewalker or retain. Well, they actually have had a one WARN though too, but not dozens as for collector.

Yes, it’s a known bug and should be fixed in 1.108.x as far as I know,

And I added your comment about piece_expiration.db to the issue

github.com/storj/storj

Nodes not marking space of TTL'd data as available

opened 01:12PM - 01 Jul 24 UTC

shaupt131

Source: https://storj.slack.com/archives/C071RHTMU75/p1719831267969909 Per Kal…yon: We have an issue with how we delete TTL pieces from the storage nodes. All my storage nodes are full since Friday and stopped receiving new uploads. Over they weekend they deleted a few hundred GB of TTL pieces, but the reclaimed space is not reported as free to the satellites, so still no new uploads. It seems deleting expired TTL pieces does not update the used space stats. ```[tasklist] ### Tasks - [x] Understand what's occurring and why - [ ] cherry-pick this fix and restart the rollout with a new version ```

Mad_Max · July 11, 2024, 4:31am

No.
As i wrote few days ago my node also did already received them back in June, processed, deleted (I checked, the GC was not working at that time, and the /retain/ folder was empty). But in July, for some reason, I received them again from the corresponding satellites:

2024-06-14T10:30:02+03:00	INFO	retain	Prepared to run a Retain request.	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Created Before": "2024-06-09T17:59:59Z", "Filter Size": 6918625, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-06-20T01:17:29+03:00	INFO	retain	Prepared to run a Retain request.	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Created Before": "2024-06-13T17:59:59Z", "Filter Size": 8959048, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-06-25T09:52:00+03:00	INFO	retain	Moved pieces to trash during retain	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Deleted pieces": 8439223, "Failed to delete": 1, "Pieces failed to read": 0, "Pieces count": 36878846, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Duration": "263h21m57.4555898s", "Retain Status": "enabled"}
2024-06-30T09:39:55+03:00	INFO	retain	Moved pieces to trash during retain	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Deleted pieces": 1196785, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 18936780, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "248h22m25.9166617s", "Retain Status": "enabled"}
2024-07-07T12:51:07+03:00	INFO	retain	Prepared to run a Retain request.	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Created Before": "2024-06-09T20:59:59+03:00", "Filter Size": 6918625, "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-07-07T12:51:07+03:00	INFO	retain	Prepared to run a Retain request.	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Created Before": "2024-06-13T20:59:59+03:00", "Filter Size": 8959048, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S"}
2024-07-08T22:12:16+03:00	INFO	retain	Moved pieces to trash during retain	{"cachePath": "C:\\Program Files\\Storj\\Storage Node/retain", "Deleted pieces": 0, "Failed to delete": 0, "Pieces failed to read": 0, "Pieces count": 14100792, "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Duration": "31h35m6.8568286s", "Retain Status": "enabled"}

Note 1: The time of the filters differs slightly (by 3 hours), but this is simply due to the format change - from UTC time to local time.
Apparently this change was made in version 105, because the first time these filters were processed was on version 104, and the second time was on version 105.
In fact, it’s the same time, the size of the filters is also the same up to a byte.

Note 2: Please note that not a single file was deleted after the re-processing ( "Deleted pieces": 0), since all of them (1,196,785 files) were already deleted in June( "Deleted pieces": 1196785).
And in July, GC just scanned everything again with the same filter and did not find any suitable ones for it.

Mad_Max · July 11, 2024, 4:37am

You’re not paying enough attention to reading. I didn’t have any warnings from the collector in the main storagenode.log.
Above, I posted the file system log(recorded by OS I/O monitoring) which shows that the process storganode.exe makes massive attempts to delete non-existent files. The collector itself did not issue any warnings to the log at that time, just silently continued to try to delete non-existent files.

Yes, it’s a known bug and should be fixed in 1.108.x as far as I know.

That’s another Issue.

Alexey · July 11, 2024, 4:58am

I see, I interpreted you words as you have warnings about not existing files in the storagenode’s logs, I’m sorry.

this is why I added your comment to the same issue related to the not updated databases. Because there seems fixed only the used-space databases, but not the TTL database.

Alexey · July 11, 2024, 5:04am

This is weird, for some reason my nodes didn’t receive a duplicates. So maybe the sending job has been killed like explained there:

and if it’s restarted, I guess it may resend BFs which were not properly marked as sent.

Pentium100 · July 11, 2024, 5:15am

Is the Uncollected Garbage reported by earnings.py accurate? My node ran out of space, but the script says that 8TB are uncollected garbage. If that is correct, then I’ll just wait for it to be collected. If it’s incorrect and most of the 8TB is real data, then I have to consider expanding it. Hard drives are expensive, so I would not want to buy a bunch of drives and then discover that the node had 8TB of free space.

Mad_Max · July 11, 2024, 5:33am

elek:

We are bumping the max BF size more and more (last time, 1-2 weeks ago, it was bumped to 25MB). You may receive smaller BF if you don’t have enough pieces.

Bigger max size requires bigger memory, OOM may happen if we don’t estimate the memory very well. In this case, we should restart the 2-3 days long process.

We are separating the BF generation per satelilte (originally we generated BFs for all the satellite in the same machine, in rotation.). This is WIP, and will be done very soon. This is one reason why you couldn’t see very regular intervals. It will help to receive BFs mor frequently.

We can further bump the BF size, but the problem is on your nodes. There are lot’s of nodes, which couldn’t walk fast enough to delete pieces in one day. That’s the biggest problem right now.

7 - It doesn’t make any sense for me.
Nodes may not be able to process BF in time indeed. But it happens due to slow disk I/O, which requires reading properties and writing millions of files for each GC pass.

But the number of files processed DOES NOT DEPEND on the size of the BF. Regardless of the size of the BF, the GC goes through ALL the files stored for this satellite.
The only difference is that bigger BF has a lower false positive percentage and due to this will delete a slightly larger part of the files it checks.
But this is a relatively small difference.
Let’s say a 10MB BF checks 30 million files and deletes(moves to /trash/) 3 million of them (a total of 33 million file operations)
A 30MB BF scans the same 30 million files and deletes 5 million of them due to more complete garbage collection (a total of 35 million file operations).
The difference in terms of disk I/O is less than 10%. And these additional disk operations will still need to be performed sooner or later - with small BF, it just happens later, after a few more filters.

The big difference of large BF lies in the significantly greater use of RAM and CPU computing resources. But this is not a problem for the vast majority of nodes. I do not see any significant load on memory and CPU from the GC running, even on my oldest node (with an AMD Phenom II processor produced about 15 years ago). With “lazy” mode enabled it very easy to check/track as it runs in a separate process. The bottleneck is always disk I/O despite the fact that this node use is a new server class HDD dedicated to Storj and SSD for DB + orders + logs. It’s just that hard drives are slow, their performance has hardly changed for several decades now. And the specific performance (performance divided by volume) is constantly decreasing.

So it looks like the bottleneck for larger BF is memory and processing power only on your side - when generating larger filters. Not on nodes.

In fact, you admit this implicitly in paragraphs 5 and 6 describing the difficulties (real ones!) of scaling up to larger BFs on the satellite side. And then you try to deny it in paragraph 7, trying to push the responsibility for not having larger BF onto the nodes.

Mad_Max · July 11, 2024, 5:49am

Yes, I agree it looks very plausible.
This can easily explain at the same time both the very fact of sending exactly the same filters again and why only some nodes received such repetitions!

Alexey · July 11, 2024, 6:23am

It’s based on the latest report from the satellites, if they all sent a report, then it highly possible, that this information is correct. At least for the next 12 hours.

      --nodestats.storage-sync duration                          how often to sync storage (default 12h0m0s)

However, the difference is calculated from the usage information from the databases.
So, if the databases aren’t updated, the difference would be incorrect, too.

zip · July 11, 2024, 6:36am

Roughly at the end of the last month there was a purge of apparently unused accounts on the EU sat (which is probably still ongoing), and I believe we are still to receive a BF for that. Then there were some smaller deletes on US sat as well.
Not sure how big is that node, but on some of the bigger ones I see a delta between the sat and the local usage of up to 5TB.
So I would wait with upgrades until the dust settles.

littleskunk · July 11, 2024, 8:19am

No. The earnings.py script displays the numbers that the storagenode hands out. The storagenode version you are running is most likely deleting TTL pieces without decreasing used space. You can check df output to see if your drive is as full as the storagenode claims.

Solution for this would be in version 107 but we had to stop the rollout because it crashes some nodes. The fix for that has been cherry picked and the rollout should continue shortly with v1.108.3

In the meantime to unblock your storagenode you can run the used space filwalker. That would correct the numbers but might take a long time to finish. Also your node will still not update used space on TTL deletes and therefor you might have to repeat this once per week or so.
The better workaround might be to just increase allocated size by the amount df shows as free space. You would still have to correct that once per week but it takes only a few seconds.

After your node has been updated to v1.108.3 you would have to run the used space filewalker once to correct the numbers. And if you did increase the allocated size you would have to reduce that to the old value again.

jammerdan · July 11, 2024, 8:52am

This should be added to the release notes and to there:

littleskunk · July 11, 2024, 9:33am

We had a similar situation a few releases ago. It didn’t help. The information about the trash cleanup bug was quickly forgotten. The problem is that the community starts to reinforce itself. I understand that running a storagenode can be frustrating and people need to vent. I am just observing the negative side effects of that. If the venting part gets too strong then people will get more and more problems to find out which informations are correct.

Pentium100 · July 11, 2024, 10:20am

Node says 50TB used, df says 48TB, the difference is likely due to different types of TB.
earnings.py output looks like this:

July 2024 (Version: 14.0.0)                                             [snapshot: 2024-07-11 10:04:42Z]
REPORTED BY     TYPE      METRIC                PRICE                     DISK  BANDWIDTH        PAYOUT
Node            Ingress   Upload                -not paid-                       10.12 TB
Node            Ingress   Upload Repair         -not paid-                       77.28 GB
Node            Egress    Download              $  2.00 / TB (avg)              327.35 GB       $  0.65
Node            Egress    Download Repair       $  2.00 / TB (avg)              272.08 GB       $  0.54
Node            Egress    Download Audit        $  2.00 / TB (avg)               96.78 MB       $  0.00
Node            Storage   Disk Current Total    -not paid-            49.56 TB
Node            Storage              ├ Blobs    -not paid-            47.78 TB
Node            Storage              └ Trash  ┐ -not paid-             1.78 TB
Node+Sat. Calc. Storage   Uncollected Garbage ┤ -not paid-             8.02 TB
Node+Sat. Calc. Storage   Total Unpaid Data <─┘ -not paid-             9.80 TB
Satellite       Storage   Disk Last Report      -not paid-            39.76 TB
Satellite       Storage   Disk Average So Far   -not paid-            41.32 TB
Satellite       Storage   Disk Usage Month      $  1.49 / TBm (avg)   14.07 TBm                 $ 20.97

According to it, there’s 1.78TB in the trash (so, the space will be free in less than a week), but also 8TB as “uncollected garbage”.

CubeFan · July 11, 2024, 11:39am

I have some of these entries through the GC of Saltlake

2024-07-11T12:18:57+02:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable {"Process": "storagenode"}
2024-07-11T12:18:57+02:00 WARN retain failed to trash piece {"Process": "storagenode", "cachePath": "config/retain", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Piece ID": "UCPFUPOC7CVV5TVOOUEMOOQE3DNYORIGJHQ4VK2RCJ5RL2F3EQRA", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj. io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:122\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:248\n\tstorj.io/storj/storagenode/pieces. (*BlobsUsageCache).trash:293\n\tstorj.io/storj/storagenode/pieces.(*Store).trash:411\n\tstorj.io/storj/storagenode/retain.(*Service). trash:422\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces.func1:381\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*trashHandler). processTrashPiece:111\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*trashHandler).writeLine:98\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker. (*trashHandler).Write:77\n\tio.copyBuffer:432\n\tio.Copy:389\n\tos/exec.(*Cmd).writerDescriptor.func1:560\n\tos/exec.(*Cmd).Start.func2:717"}

Is it possible that these are files that have already been deleted by TTL? Do I have to worry about audits?

Roxor · July 11, 2024, 12:00pm

Ah… I thought there was something fishy with the counts of 1.107 since yesterday. I guess I’ll wait to see when 108 shows up…

Vadim · July 11, 2024, 12:41pm

Will this bootstrap buffer file will be on data storage HDD or with DBs by default or configurable place? if with data, then there will be problems to write it as databases have when they are on data hdd.

littleskunk · July 11, 2024, 12:46pm

I am sorry you lost me. There is no bootstrap file.

Vadim · July 11, 2024, 12:49pm

What this thing will do then?

2fceb6c storagenode/blobstore: blobstore with caching file stat information (mod time, size)