Release preparation v1.110

Alexey · August 10, 2024, 3:25am

Do you may be have a retain process which moves pieces to the trash?

jammerdan · August 10, 2024, 4:47am

Nice to hear that I am not the only one waiting for this feature to be completed and deployed.

Last time I had checked there was no visible progress and no answer:

Alexey · August 13, 2024, 3:57am

10 posts were split to a new topic: Pieces failed to migrate v0 piece. Piece may not be recoverable

jammerdan · August 16, 2024, 5:15am

@Alexey
I see the link to the work in progress has received 25 hits.
The need for this feature has been expressed by several SNOs now but there is no response and no visible progress.
Maybe you can get an answer and pass it to us waiting?

Alexey · August 16, 2024, 6:23am

Yes I notified the team.

naxbc · August 16, 2024, 7:47am

I’d even say this feature would be crucial and gamechanger for a huge chunk on SNO’s!

Ruskiem · August 17, 2024, 9:24pm

Hello would like to say ver. 1.110.3 for Windows is absolutely rough, its attacking my last bit of free space on SSD, reserved for logs, with “WARN collector” about piece cannot delete because it does not exist. i Call it an attack, because as i was about to go sleep, spotted the node got offline, it got updated to 1.110.3 and flooded the log in minutes with GB’s of space, which happened to be above available free capacity and the node went offline. Pls don’t roll that version on my other windows nodes. thx.

Edit: i resolved that just by mannually logrotate my log, which is set on every other day.
but still, freaking annoying.

Ambifacient · August 17, 2024, 10:27pm

You can hide collector logs using something like log.custom-level: collector=FATAL, but you should probably have a log rotation policy based on file size anyways.

Alexey · August 18, 2024, 3:48am

Yes, if they used a logrotate or a custom script which could detect that. Very often the script is set to rotate logs no more than once a week and if they used this one:

then it doesn’t check the size, it’s developed to rotate the log when it’s called.

snorkel · August 18, 2024, 7:37am

This happens since 108, if I remember correctly. I think I was the first one reporting it. The solution is like the fellow member said: use custom log level for collector.
https://forum.storj.io/t/log-custom-level/25839/18?u=snorkel

jammerdan · August 19, 2024, 1:40am

Yes, because even the non-lazy filewalker does not get to finish and keeps restarting unless turned off completely:

ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalker: context canceled",

So let’s see when there will be an answer when this is going to be resolved.

jammerdan · August 19, 2024, 2:54am

jammerdan:

ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S", "Lazy File Walker": false, "error": "filewalker: context canceled",

@Alexey
Is that context canceled error like a timeout or something?

Alexey · August 19, 2024, 3:02am

Yes, it’s a timeout, the filewalker was unable to receive a response from the disk and was canceled. I do not think that it would be restarted later. But you may check - do you have a start used-space-filewalker for that satellite later in the log?

docker logs storagenode 2>&1 | grep "\sused-space" | grep -E "started|completed"

jammerdan · August 19, 2024, 3:09am

Why is there even a timeout on this? Does it matter how long it takes to receive a response for the used space filewalker? i don’t think so. IMHO waiting would be better than canceling and restarting the whole thing.

It seems it does which makes the whole process worse:

2024-08-17T04:00:26Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-18T04:00:37Z    ERROR   pieces  used-space-filewalker failed    {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "error": "filewalker: context canceled"
2024-08-18T04:01:43Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}

From the logs it looks like it has been running for a day then got interrupted and restarted from the beginning. This seems terrible.

Alexey · August 19, 2024, 3:15am

Yes it is. Otherwise it may hang. The timeout must be exist for any disk operation. We have had a disqualified nodes because they were able to respond on the audit requests but was unable to provide a piece, and since there was no timeout, the node started to fail every single audit after the partial hang.
So, now we have a readable/writeable checks and added timeouts in all operations.

For your case the resume likely wouldn’t help, if the filewalker is suffering from the disk unresponsiveness, it would continue to fail. If even non-lazy filewalker is unable to finish, you need to set the allocation below the usage from the dashboard (to stop any ingress) and restart the node to allow the filewalker to finish the scan. And I would like to suggest enabling Badger caching as well.

jammerdan · August 19, 2024, 3:24am

Maybe the timeout can be increased then?

Exactly it restarts from the scratch and makes things worse!!!

From my logs it seems that at 04:00 something happened that resulted in a context canceled error. I don’t know what it was but maybe trash deletion kicked in or something. It looks like some timed automated process. Now if it was something the node has even started, then it would be worse if this action results in context canceled errors and killing the long running filewalker.

This is why we need the resume feature for this!!!

Badger cache is on. This is why he filewalker is in non-lazy mode at all. But it is on the same disk so I don’t know how much this could help.

Alexey · August 19, 2024, 3:43am

It’s better to fix the underlaying issue. With badger cache it should help to speedup a filewalker. Even if the filewalker would fail, the next run would be faster and it can scan further every time, and finally should successfully finish the scan.

It’s on the same disk for this node too:

Badger cache filewalker test results

Without badger, lazy off:

2024-08-11T04:17:34Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-17T20:45:39Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 2582215761152, "Total Pieces Content Size": 2576394000640}

with badger cache:

2024-08-17T23:02:50Z    INFO    pieces  used-space-filewalker started   {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE"}
2024-08-17T23:46:56Z    INFO    pieces  used-space-filewalker completed {"Process": "storagenode", "Satellite ID": "1wFTAgs9DP5RSnCqKV1eLf6N9wtk4EAtmN5DpSxcs8EjT69tGE", "Lazy File Walker": false, "Total Pieces Size": 2599137402368, "Total Pieces Content Size": 2593355542528}

6d 16h → 44m

However, if you run a docker node, then you can provide a different path for the file cache.

jammerdan · August 19, 2024, 3:52am

I agree. But also for tasks like the filewalker maybe we do not need short timeouts. I don’t know the value but maybe longer timeouts will do and help.

Let’s hope so.

If you have a different disk. I currently don’t have one for this node.

Alexey · August 19, 2024, 3:56am

I wouldn’t like to increase disk timeouts where they can be dangerous.

littleskunk · August 19, 2024, 9:58am

If it gets interrupted every 24 hours than there must be something running on your system that restarts the node every 24 hours. Otherwise you wouldn’t get such a timing. One perfect timed restart might be luck. A second one gets already unlikely. A third one is impossible and requires something like a cronjob triggering it.