Updates on Test Data

Last few days look like normal 75%-crazy: nothing special.

1 Like

Two nodes were updated to version 1.107.3. After the update, many records appeared in the logs like this:

2024-07-02T23:11:27Z WARN collector unable to delete piece {“Process”: “storagenode”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Piece ID”: “DA3NATNM5X6BWSRR6XUSOOB4AIP745ERCBSPV5VDVZXXF5G2TFRA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n
tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:122\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:270\n\tstorj.io/storj/storagenode/pieces.(*Blobs
UsageCache).DeleteWithStorageFormat:250\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:355\n\tstorj.io/storj/storagenode/collector.(*Service).Collect.func1:88\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:71\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:575\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:
87\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:65\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:61\n\tstorj.io/storj/p
rivate/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

Why did they appear? Is this expected?

Before the update there was no errors (log grepped by “collect”):

2024-07-02T09:43:15Z INFO collector collect {“Process”: “storagenode”, “count”: 28}
2024-07-02T10:43:14Z INFO collector collect {“Process”: “storagenode”, “count”: 18}
2024-07-02T11:43:44Z INFO collector collect {“Process”: “storagenode”, “count”: 727}
2024-07-02T12:43:43Z INFO collector collect {“Process”: “storagenode”, “count”: 1027}
2024-07-02T13:43:22Z INFO collector collect {“Process”: “storagenode”, “count”: 304}
2024-07-02T14:43:55Z INFO collector collect {“Process”: “storagenode”, “count”: 1356}
2024-07-02T15:43:42Z INFO collector collect {“Process”: “storagenode”, “count”: 696}
2024-07-02T16:43:47Z INFO collector collect {“Process”: “storagenode”, “count”: 1075}
2024-07-02T17:43:38Z INFO collector collect {“Process”: “storagenode”, “count”: 970}
2024-07-02T18:44:12Z INFO collector collect {“Process”: “storagenode”, “count”: 2015}
2024-07-02T19:43:43Z INFO collector collect {“Process”: “storagenode”, “count”: 1016}
2024-07-02T20:43:14Z INFO collector collect {“Process”: “storagenode”, “count”: 13}
2024-07-02T21:43:14Z INFO collector collect {“Process”: “storagenode”, “count”: 23}
2024-07-02T22:43:15Z INFO collector collect {“Process”: “storagenode”, “count”: 27}

16TB full node, version 1.105.? to 1.107.3 database migration took over 70 minutes.
Running on Intel Xeon, databases are on some cheap 512GB SATA SSD, piece_expiration.db was 17GB before the migration.
This should be deployed with caution, as it will probably take much much longer on nodes having databases on HDDs.

2024-07-03T01:06:19+02:00       INFO    db.migration.58 Remove unused trash column      {"Process": "storagenode"}
2024-07-03T01:26:27+02:00       INFO    db.migration.59 Remove unused deletion_failed_at column {"Process": "storagenode"}
2024-07-03T01:44:42+02:00       INFO    db.migration.60 Overhaul piece_expirations      {"Process": "storagenode"}
2024-07-03T02:18:45+02:00       INFO    db.migration    Database Version        {"Process": "storagenode", "version": 60}

And also seeing significant amounts of "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist" messages as the trash-cleanup-filewalker is running on the node.

4 Likes

70 minutes is quite long. My piece expiration DB was around 4GB and migration took around 5 minutes with a 12600k, also with an old SATA SSD.

Maybe, but it could have been regular downloads. Should not cause your node to faceplant. This can happen if OOM monitor nukes it due to slow storage.

FWIW I also see massive amount of repair requests. Curious to know why

I’m curious… what bandwidth does this possible customer has, that can’t be satisfied by 22K nodes?

They are a home security company so plenty of bandwidth, IMO.

And their name starts with V?

And ends with „t“? :wink:

Hopefully it does… :innocent:

1 Like

Just a guess, but this seems related to the fact that some of the initial space reservation data was overwritten due to insufficent randomness.

It would make sense that these pieces were deleted by the GC, as we experienced a couple of weeks ago, and now that the TTL is reached as defined in the db, its unable to find these previously deleted files.

If this is the case, the warnings should not be cause for concern.

On the other hand, if we start to experience elevated amounts of reservation data being deleted by GC instead of TTL, it could mean that there is an underlying issue with the TTL collector and its not correctly deleting the files (because it cant find them).

3 Likes

Out of curiosity I started a saltlake only node today. This seems to be what the new node selection algo likes, it runs like crazy. Not a big surprice. :wink:

If you want to see those, turn on debug logging. I see them though I’m starting to think debug logging is a bad idea as those writes are now a significant part of the io load for GC and TTL deletes.

1 Like

That is never a good idea. A test data only strategy sounds good short term but will backfire long term. The size of this node is limited (TTL) and fully dependent on storj itself. Sure it sounds like we will get additional testdata uploads for up to 3 months but one way or the other I would expect this income source to be lowered and ideally replaced if not outmatched by real customer uploads. At that point you would be looking at the wrong satellite.

2 Likes

Once the deal is signed I will switch it to US1 only. :wink:

BTW:
Just noticed this:


2024-07-03T16:45:19+02:00       INFO    collector       collect {"Process": "storagenode", "count": 2}

I wonder if this is a bug or TTL of only hours?

And you will have to wait for vetting, in the same time when we get all the juicy customer data. Smart approach. :sweat_smile:
Believe me, I thought about every strategy, including this one. This one is bad.
You will only see your node shrinking, and we will have the same occupation.

But why are you actively trying to prevent node from making you more money?

1 Like

Vetting is a day or two atm. So I don’t care about.

Actually I am making more money this way, ingress is significantly higher compared to my other nodes.

Did you get to the bottom of it? Why would limiting sources of ingress increases it?

1 Like

Just wanted to provide a quick update. For the upcoming holiday / long weekend we are going to pause the uploads for a few days. The repair queue has grown a bit and we don’t want to be paging engineers over the holiday. We will re-evaluate on Monday and determine next steps from there and will let you know

20 Likes