Suddenly no more ingress

Alexey · May 11, 2024, 4:51am

Could you please give logs for the period before and after?

ACarneiro · June 23, 2024, 10:23am

Funny you should mention that.
Yesterday one of my nodes (which was previous full and with a lot of trash) purged its trash data, and so had free space again again (about 2TB).
It wouldn’t start accepting ingress data until I restarted it (I waited about 30 minutes to see if it had to “tell” the satellites something).

Other nodes didn’t exhibit this behaviour after trash deletion.

No idea what happened.

Out of interest, was your offending node in a similar circumstance as this one?

EDIT: Crap, I just noticed how old this thread was. Sorry!

Alexey · June 23, 2024, 12:28pm

The default check-in interval is 1h…
However, the restart should work too. The node will try to check-in immediately.

christos · September 1, 2024, 10:37am

I’m not sure if I should continue this thread or I should start a new topic, but my issue seems rather similar. Ingress just stopped, although I’m constantly online, there is plenty of space left in the disk and the internet speed is excellent. In fact not only ingress is too low, but disk usage keeps falling. To make things worse, that didn’t even happen in just one node, but in several nodes hosted in the same location.

I think there is no point analyzing all of them, as it seems to be similar for each one, as the problems started all together. Worse online time I see 99.67%, so that can’t be the problem. As I remember it used to be using around 5TB, increasing about 20GB/hour. Then increase stopped and started reducing. Now it’s using 4,3TB. For these first 13 hours of September, I only have 8.43GB ingress. Success rates look good in my eyes…

========== AUDIT =============
Critically failed: 0
Critical Fail Rate: 0,00%
Recoverable failed: 0
Recoverable Fail Rate: 0,00%
Successful: 506
Success Rate: 100,00%
========== DOWNLOAD ==========
Failed: 20
Fail Rate: 0,14%
Canceled: 239
Cancel Rate: 1,72%
Successful: 13649
Success Rate: 98,14%
========== UPLOAD ============
Rejected: 0
Acceptance Rate: 100,00%
---------- accepted ----------
Failed: 141
Fail Rate: 0,70%
Canceled: 14
Cancel Rate: 0,07%
Successful: 20118
Success Rate: 99,24%
========== REPAIR DOWNLOAD ===
Failed: 0
Fail Rate: 0,00%
Canceled: 0
Cancel Rate: 0,00%
Successful: 26
Success Rate: 100,00%
========== REPAIR UPLOAD =====
Failed: 0
Fail Rate: 0,00%
Canceled: 0
Cancel Rate: 0,00%
Successful: 590
Success Rate: 100,00%

jammerdan · September 1, 2024, 10:43am

Same yesterday on one of my nodes. Free space was around 1 TB.
Several restarts did not help. Only help was to increase assigned space, right after that it started to accept uploads again.

christos · September 1, 2024, 10:58am

Can you see how fast it receives uploads? Because it seems to receive some uploads, but the download rate seems to be rather higher.

christos · September 1, 2024, 11:02am

I don’t know if it’s somehow connected, but in the logs I have plenty of
WARN collector unable to delete piece “error”: “pieces error: filestore error: file does not exist”

jammerdan · September 1, 2024, 11:43am

How many per seconds or minute, you mean?

My logs for this node aren’t good at the moment as I am rejecting everyting but 1.
But from the rate of rejections I can see like 1-5 rejected uploads per second second.

DusterCluster · September 1, 2024, 12:29pm

The test data was completed and the ingress fell 98% for me. Today on September 1st, after 14 hours I have about 32 Gb ingress.

christos · September 1, 2024, 12:46pm

I mean what’s an approximate ingress rate for your node. For example, for the current month, mine shows 10GB ingress. Considering it’s the 1st of September and the time approximately 15:30, turns out to be 0.65GB/hour. That’s very low for no apparent reason, but also used space is actually reducing.

Just to make things clear, vetting completed a month ago.

jammerdan · September 1, 2024, 1:02pm

This node has only around 8GB ingress today so far. But as said I have limited the number of uploads so it is expected to be low.

agente · September 1, 2024, 6:15pm

This is exactly the real ingress before test. Maybe little more than past average. So… be happy!

Knowledge · September 1, 2024, 6:21pm

This is normal. We are not running any testing on Salt Lake servers right now. It is also a holiday weekend in the US so customers are likely not in the office doing work either. Possibly data will pick up early next week.

Roxor · September 1, 2024, 10:24pm

Yup: this is 100% authentic home-grown grass-fed antibiotic-free free-range ingress… no artificial test data. Data that’s much more likely to stick around more than 30 days. The good stuff!

It just feels bad to watch that slow consistent climb in used-space get totally overwhelmed by expired TTL for more than a week now. But we’re getting quality over quantity!

Alexey · September 2, 2024, 3:30am

This is need to be checked. The node reporting the calculated allocated and used space to the satellites, and if it believes that it’s full, it would report so to the satellites and there would be no ingress.
So, please make sure that your node has a free space accordingly your dashboard and that it matches the usage reported by the OS. If not, you need to enable the scan on startup if you disabled it (it’s enabled by default) and restart the node. When the used-space-filewalker would finish scan for all trusted satellites it should update the usage in the databases.
You may also search for "less than allocated" message in your logs.

christos · September 3, 2024, 10:27am

Well, the node is using the space of a 9TB vhdx. I have set the node to use 6TB for now.
Yesterday, it was using 3.75TB, 2.07 free, 173GB trash.
Today, it’s using 3.17TB, 2.66TB free, 173GB trash.
Windows VM reports 5.08TiB out of 8.18TiB free, 3.38TB used (which approximately matches what the node reports, 3.19TB+0.17TB=3.36TB).
I guess that rules out any available space problem.

Scan on startup is default, never touched it.

I searched for the world “allocated” in the logs and it found nothing.

However, what makes me suspicious is that I see a lot of lines like this:
collector unable to delete piece {“Satellite ID”: “1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE”, “Piece ID”: “HDMVDGZGL7YM6COOIDW2RVDIAFGKFZ5NV7S7U7XEHLB6OI6FNRPQ”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n[tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:124](http://tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:124)\n[tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:340](http://tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:340)\n[tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:320](http://tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).DeleteWithStorageFormat:320)\n[tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:359](http://tstorj.io/storj/storagenode/pieces.(*Store).DeleteSkipV0:359)\n[tstorj.io/storj/storagenode/collector.(*Service).Collect:112](http://tstorj.io/storj/storagenode/collector.(*Service).Collect:112)\n[tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68](http://tstorj.io/storj/storagenode/collector.(*Service).Run.func1:68)\n[tstorj.io/common/sync2.(*Cycle).Run:99](http://tstorj.io/common/sync2.(*Cycle).Run:99)\n[tstorj.io/storj/storagenode/collector.(*Service).Run:64](http://tstorj.io/storj/storagenode/collector.(*Service).Run:64)\n[tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87](http://tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:87)\n\truntime/pprof.Do:44\n[tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86](http://tstorj.io/storj/private/lifecycle.(*Group).Run.func2:86)\n[tgolang.org/x/sync/errgroup.(*Group).Go.func1:78](http://tgolang.org/x/sync/errgroup.(*Group).Go.func1:78)”}

christos · September 3, 2024, 10:39am

I’m thinking if there is a chance that these files were deleted in the past, but they were not reported that were actually deleted? Because, as I see, it never tries to delete the same file again. Although, that doesn’t explain the non-existent ingress, as there is plenty of space.

By the way, they are all saltlake.tardigrade.io:7777.

Update:
I also found a lot of these, which are all us1.storj.io:7777:
2024-09-03T12:45:18+03:00 WARN pieces failed to migrate v0 piece. Piece may not be recoverable
2024-09-03T12:45:18+03:00 WARN retain failed to trash piece {“cachePath”: “C:\Program Files\Storj\Storage Node/retain”, “Satellite ID”: “12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S”, “Piece ID”: “DYJPKYB7UVBPZIV6E6YEDHPM534YFZ325H76LH6GIA6CV5OREZIA”, “error”: “pieces error: filestore error: file does not exist”, “errorVerbose”: “pieces error: filestore error: file does not exist\n\tstorj.io/storj/storagenode/blobstore/filestore.(*blobStore).Stat:124\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:340\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).TrashWithStorageFormat:406\n\tstorj.io/storj/storagenode/pieces.(*Store).Trash:422\n\tstorj.io/storj/storagenode/retain.(*Service).trash:428\n\tstorj.io/storj/storagenode/retain.(*Service).retainPieces.func1:387\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*TrashHandler).processTrashPiece:112\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*TrashHandler).writeLine:99\n\tstorj.io/storj/storagenode/pieces/lazyfilewalker.(*TrashHandler).Write:78\n\tio.copyBuffer:429\n\tio.Copy:386\n\tos/exec.(*Cmd).writerDescriptor.func1:560\n\tos/exec.(*Cmd).Start.func2:717”}

And then I saw stuff like these:
INFO collector expired pieces collection completed {“count”: 6576}
Average count approximately 7000 pieces. So, as I understand every hour collector finds approximately 7000 expired pieces. Could that be normal?

edo · September 3, 2024, 10:53am

Hi Christos, I don’t think these log lines are related to the issue you’re facing. They typically appear when a piece has expired but was garbage collected earlier and therefore does not exist anymore. This is a common occurrence and shouldn’t be a cause for concern.

Alexey · September 4, 2024, 7:07am

If the WARN comes from the TTL collector or from the retain process, then the piece likely was either moved by a retain to the trash, or the collector was able to delete the expired piece before the retain was able to move it to the trash (the case depends on from where you took this WARN).

And it would be a completely different story, if you got it from GET_AUDIT or GET_REPAIR, in that case it will affect the audit score.

Seems the second case, the TTL collector was faster. In that case it can be ignored.

So, @edo is correct, it’s not related to the issue.
Do you have a suspension score and online score higher than 60%?

Looks like a TTL collector removes the expired pieces.

christos · September 5, 2024, 11:15pm

Well, yes, servers report online 99,89, 99,56 99,89 and 100,00.

So, are you actually saying that it’s normal that I had 5TB used and now just 2TB? I mean, it was filling up like crazy and a fine day it was like, “ok, you know what? I was wrong, remove that data from there”. What’s more, I see it’s not only just one node, but at least 3 of my oldest ones were affected. Small nodes, so, not so painful, but still. Good example that, 2 full 1TB disks, the first one all of a sudden have 220GB usage and the second one 210 free, 630 trash (!).

The only thing that I thought about is that maybe by enabling my white, static IP and turning off the multiple IPs, Storj system decided to relocated old data. But still, this doesn’t explain why my big disk was emptied, as it started working only after I got the new IP.