Release preparation v1.110

Do you may be have a retain process which moves pieces to the trash?

Nice to hear that I am not the only one waiting for this feature to be completed and deployed.

Last time I had checked there was no visible progress and no answer:

2 Likes

10 posts were split to a new topic: Pieces failed to migrate v0 piece. Piece may not be recoverable

@Alexey
I see the link to the work in progress has received 25 hits.
The need for this feature has been expressed by several SNOs now but there is no response and no visible progress.
Maybe you can get an answer and pass it to us waiting?

2 Likes

Yes I notified the team.

1 Like

I’d even say this feature would be crucial and gamechanger for a huge chunk on SNO’s!

1 Like

Hello would like to say ver. 1.110.3 for Windows is absolutely rough, its attacking my last bit of free space on SSD, reserved for logs, with “WARN collector” about piece cannot delete because it does not exist. i Call it an attack, because as i was about to go sleep, spotted the node got offline, it got updated to 1.110.3 and flooded the log in minutes with GB’s of space, which happened to be above available free capacity and the node went offline. Pls don’t roll that version on my other windows nodes. thx.

Edit: i resolved that just by mannually logrotate my log, which is set on every other day.
but still, freaking annoying.

You can hide collector logs using something like log.custom-level: collector=FATAL, but you should probably have a log rotation policy based on file size anyways.

2 Likes

Yes, if they used a logrotate or a custom script which could detect that. Very often the script is set to rotate logs no more than once a week and if they used this one:

then it doesn’t check the size, it’s developed to rotate the log when it’s called.

This happens since 108, if I remember correctly. I think I was the first one reporting it. The solution is like the fellow member said: use custom log level for collector.
https://forum.storj.io/t/log-custom-level/25839/18?u=snorkel

1 Like

Yes, because even the non-lazy filewalker does not get to finish and keeps restarting unless turned off completely:

ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalker: context canceled",

So let’s see when there will be an answer when this is going to be resolved.

1 Like

@Alexey
Is that context canceled error like a timeout or something?

Yes, it’s a timeout, the filewalker was unable to receive a response from the disk and was canceled. I do not think that it would be restarted later. But you may check - do you have a start used-space-filewalker for that satellite later in the log?

docker logs storagenode 2>&1 | grep "\sused-space" | grep -E "started|completed"

Why is there even a timeout on this? Does it matter how long it takes to receive a response for the used space filewalker? i don’t think so. IMHO waiting would be better than canceling and restarting the whole thing.

It seems it does which makes the whole process worse:

2024-08-17T04:00:26Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-18T04:00:37Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "error": "filewalker: context canceled"
2024-08-18T04:01:43Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

From the logs it looks like it has been running for a day then got interrupted and restarted from the beginning. This seems terrible.

Yes it is. Otherwise it may hang. The timeout must be exist for any disk operation. We have had a disqualified nodes because they were able to respond on the audit requests but was unable to provide a piece, and since there was no timeout, the node started to fail every single audit after the partial hang.
So, now we have a readable/writeable checks and added timeouts in all operations.

For your case the resume likely wouldn’t help, if the filewalker is suffering from the disk unresponsiveness, it would continue to fail. If even non-lazy filewalker is unable to finish, you need to set the allocation below the usage from the dashboard (to stop any ingress) and restart the node to allow the filewalker to finish the scan. And I would like to suggest enabling Badger caching as well.

Maybe the timeout can be increased then?

Exactly it restarts from the scratch and makes things worse!!!

From my logs it seems that at 04:00 something happened that resulted in a context canceled error. I don’t know what it was but maybe trash deletion kicked in or something. It looks like some timed automated process. Now if it was something the node has even started, then it would be worse if this action results in context canceled errors and killing the long running filewalker.

This is why we need the resume feature for this!!!

Badger cache is on. This is why he filewalker is in non-lazy mode at all. But it is on the same disk so I don’t know how much this could help.

It’s better to fix the underlaying issue. With badger cache it should help to speedup a filewalker. Even if the filewalker would fail, the next run would be faster and it can scan further every time, and finally should successfully finish the scan.

It’s on the same disk for this node too:

However, if you run a docker node, then you can provide a different path for the file cache.

I agree. But also for tasks like the filewalker maybe we do not need short timeouts. I don’t know the value but maybe longer timeouts will do and help.

Let’s hope so.

If you have a different disk. I currently don’t have one for this node.

I wouldn’t like to increase disk timeouts where they can be dangerous.

If it gets interrupted every 24 hours than there must be something running on your system that restarts the node every 24 hours. Otherwise you wouldn’t get such a timing. One perfect timed restart might be luck. A second one gets already unlikely. A third one is impossible and requires something like a cronjob triggering it.