Updates on Test Data

Roxor · July 2, 2024, 9:42pm

Last few days look like normal 75%-crazy: nothing special.

Andrew · July 2, 2024, 11:29pm

Two nodes were updated to version 1.107.3. After the update, many records appeared in the logs like this:

2024-07-02T23:11:27Z WARN collector unable to delete piece {“Process”: “storagenode”, “Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Piece ID”: “DA3NATNM5X6BWSRR6XUSOOB4AIP745ERCBSPV5VDVZXXF5G2TFRA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n
tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:122\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:270\n\tstorj.io/storj/storagenode/pieces.(*Blobs
UsageCache).DeleteWithStorageFormat:250\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:355\n\tstorj.io/storj/storagenode/collector.(*Service).Collect.func1:88\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).GetExpired:71\n\tstorj.io/storj/storagenode/pieces.(*Store).GetExpired:575\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:
87\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:65\n\tstorj.io/common/sync2.(*Cycle).Run:99\n\tstorj.io/storj/storagenode/collector.(*Service).Run:61\n\tstorj.io/storj/p
rivate/lifecycle.(*Group).Run.func2.1:87\n\truntime/pprof.Do:51\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78”}

Why did they appear? Is this expected?

Before the update there was no errors (log grepped by “collect”):

2024-07-02T09:43:15Z INFO collector collect {“Process”: “storagenode”, “count”: 28}
2024-07-02T10:43:14Z INFO collector collect {“Process”: “storagenode”, “count”: 18}
2024-07-02T11:43:44Z INFO collector collect {“Process”: “storagenode”, “count”: 727}
2024-07-02T12:43:43Z INFO collector collect {“Process”: “storagenode”, “count”: 1027}
2024-07-02T13:43:22Z INFO collector collect {“Process”: “storagenode”, “count”: 304}
2024-07-02T14:43:55Z INFO collector collect {“Process”: “storagenode”, “count”: 1356}
2024-07-02T15:43:42Z INFO collector collect {“Process”: “storagenode”, “count”: 696}
2024-07-02T16:43:47Z INFO collector collect {“Process”: “storagenode”, “count”: 1075}
2024-07-02T17:43:38Z INFO collector collect {“Process”: “storagenode”, “count”: 970}
2024-07-02T18:44:12Z INFO collector collect {“Process”: “storagenode”, “count”: 2015}
2024-07-02T19:43:43Z INFO collector collect {“Process”: “storagenode”, “count”: 1016}
2024-07-02T20:43:14Z INFO collector collect {“Process”: “storagenode”, “count”: 13}
2024-07-02T21:43:14Z INFO collector collect {“Process”: “storagenode”, “count”: 23}
2024-07-02T22:43:15Z INFO collector collect {“Process”: “storagenode”, “count”: 27}

zip · July 3, 2024, 12:26am

16TB full node, version 1.105.? to 1.107.3 database migration took over 70 minutes.
Running on Intel Xeon, databases are on some cheap 512GB SATA SSD, piece_expiration.db was 17GB before the migration.
This should be deployed with caution, as it will probably take much much longer on nodes having databases on HDDs.

2024-07-03T01:06:19+02:00       INFO    db.migration.58 Remove unused trash column      {"Process": "storagenode"}
2024-07-03T01:26:27+02:00       INFO    db.migration.59 Remove unused deletion_failed_at column {"Process": "storagenode"}
2024-07-03T01:44:42+02:00       INFO    db.migration.60 Overhaul piece_expirations      {"Process": "storagenode"}
2024-07-03T02:18:45+02:00       INFO    db.migration    Database Version        {"Process": "storagenode", "version": 60}

And also seeing significant amounts of "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist" messages as the trash-cleanup-filewalker is running on the node.

Ambifacient · July 3, 2024, 12:33am

70 minutes is quite long. My piece expiration DB was around 4GB and migration took around 5 minutes with a 12600k, also with an old SATA SSD.

arrogantrabbit · July 3, 2024, 1:23am

Maybe, but it could have been regular downloads. Should not cause your node to faceplant. This can happen if OOM monitor nukes it due to slow storage.

FWIW I also see massive amount of repair requests. Curious to know why

snorkel · July 3, 2024, 4:25am

I’m curious… what bandwidth does this possible customer has, that can’t be satisfied by 22K nodes?

nerdatwork · July 3, 2024, 4:43am

They are a home security company so plenty of bandwidth, IMO.

jammerdan · July 3, 2024, 6:38am

And their name starts with V?

unrealSpeedy · July 3, 2024, 7:53am

And ends with „t“?

jammerdan · July 3, 2024, 8:00am

Hopefully it does…

pasatmalo · July 3, 2024, 9:15am

Just a guess, but this seems related to the fact that some of the initial space reservation data was overwritten due to insufficent randomness.

It would make sense that these pieces were deleted by the GC, as we experienced a couple of weeks ago, and now that the TTL is reached as defined in the db, its unable to find these previously deleted files.

If this is the case, the warnings should not be cause for concern.

On the other hand, if we start to experience elevated amounts of reservation data being deleted by GC instead of TTL, it could mean that there is an underlying issue with the TTL collector and its not correctly deleting the files (because it cant find them).

pangolin · July 3, 2024, 12:57pm

Out of curiosity I started a saltlake only node today. This seems to be what the new node selection algo likes, it runs like crazy. Not a big surprice.

BrightSilence · July 3, 2024, 1:33pm

If you want to see those, turn on debug logging. I see them though I’m starting to think debug logging is a bad idea as those writes are now a significant part of the io load for GC and TTL deletes.

littleskunk · July 3, 2024, 1:50pm

That is never a good idea. A test data only strategy sounds good short term but will backfire long term. The size of this node is limited (TTL) and fully dependent on storj itself. Sure it sounds like we will get additional testdata uploads for up to 3 months but one way or the other I would expect this income source to be lowered and ideally replaced if not outmatched by real customer uploads. At that point you would be looking at the wrong satellite.

pangolin · July 3, 2024, 2:03pm

Once the deal is signed I will switch it to US1 only.

BTW:
Just noticed this:


2024-07-03T16:45:19+02:00       INFO    collector       collect {"Process": "storagenode", "count": 2}

I wonder if this is a bug or TTL of only hours?

snorkel · July 3, 2024, 5:00pm

And you will have to wait for vetting, in the same time when we get all the juicy customer data. Smart approach.
Believe me, I thought about every strategy, including this one. This one is bad.
You will only see your node shrinking, and we will have the same occupation.

arrogantrabbit · July 3, 2024, 5:05pm

But why are you actively trying to prevent node from making you more money?

pangolin · July 3, 2024, 5:28pm

Vetting is a day or two atm. So I don’t care about.

Actually I am making more money this way, ingress is significantly higher compared to my other nodes.

arrogantrabbit · July 3, 2024, 6:06pm

Did you get to the bottom of it? Why would limiting sources of ingress increases it?

Bryanm · July 3, 2024, 6:45pm

Just wanted to provide a quick update. For the upcoming holiday / long weekend we are going to pause the uploads for a few days. The repair queue has grown a bit and we don’t want to be paging engineers over the holiday. We will re-evaluate on Monday and determine next steps from there and will let you know