Testing Garbage collector

This error is indication of slow filesystem. The delete operation took more time than a specified timeout.
Unfortunately I do not know, what the timeout is. However, it’s sign to check your configuration.
Usually this happened on slow systems like Raspberry Pi, or if you use the network connected storage, but maybe we have something other on plate.

1 Like

it is based on a rock64… so it could be slow.
I am planning to get it migrated to another board that should be faster.

it is not happening that often, it happened twice this month.

is that a big deal or will the garbage collection cycle catch this?

Thanks!

None of the numbers I’ve seen so far are a big deal, because yes, garbage collection will clean up those pieces. And even a few hundred don’t amount to much total space.

From my end I can confirm that they stopped happening frequently after my RAID repair was done, which confirms what @Alexey mentioned.

Okay, i’ll keep an eye on this.

Thanks everyone!

I still own you that list of tests. I haven’t create the list because there are still changes in the current and in the next release. Please decide yourself if you want to execute the tests early even if that might mean executing them more than once or hold off and wait for the final call. I would wait for the final call and let garbage collection handle it for now.

For the test you need a storj-sim instance. Disqualification is not a risk because storj-sim is a test network and you have full control over it. You can remove any disqualification in your local storj-sim test network.

Tests:

  1. Uplink long tail cancelation.
    1.1 Upload a file with 1 segment.
    1.2 Check the storj-sim log output. Does one storage node get cutoff?
    1.3 Check the storage directory of that storage node. There should be an empty folder.

  2. Uplink deletes a file.
    2.1 Upload a file that has at least 3 segments.
    2.2 Delete the file
    2.3 Verify that all pieces got removed from the storage nodes. Only empty folders.

  3. Uplink overwrites a file.
    3.1 Upload a file that has at least 3 segments.
    3.2 Upload a different file to the same bucket with the same file name.
    3.3 Verify that the old file was removed from the storage nodes.

  4. Uplink cancels an upload.
    4.1 Upload a file with at least 3 segments, wait until at least the first segment is finished, cancel the upload.
    4.2 Verify that all pieces got removed from the storage nodes. Only empty folders.

  5. Uplink cancels an delete.
    5.1 Upload a file with at least 3 segments.
    5.2 Delete the file but cancel the execution.
    5.3 Verify that all pieces got removed from the storage nodes. Only empty folders.

I would place my bet on test 3 and 4 and If I am not mistaken the developer team is working on that. I am not aware of any changes that would touch test 1 and 5. Maybe start with these tests and see how it goes. If you have any questions I am happy to help. Also feel free to add tests that might be missing in the list.

2 Likes

Thank you for the list of scenarios.

I downloaded and compiled storj-sim, but haven’t run it yet. I will get the test environment running this week and start with a few of the scenarios. I’m sure I’ll have questions related to setting up the test environment… I found the blog post related to storj-sim, but is there a collection of documents on how to use the test environment?

1 Like

You can find it here

1 Like

More updated storj-sim guide on wiki: https://github.com/storj/storj/wiki/Test-network

Thanks for the update…

Just getting back into the swing of things this week… recovering from the flu. Who would think that one would get the flu if one submitted to a yearly flu shot? Oh well. At least I remember why I get the shot yearly…

1 Like
docker logs storagenode 2>&1 | grep "delete failed" -c
506

du -sh trash/
22G     trash/

isn’t this quite a lot.
I know it possibly is because i use a Rock64 and a USB3 disk.
but… the disk cannot be that slow.

Got the same since one week. Before never get this Message. Since these days, my Audit Score goes up and down too.

I have the same setup. About 24 hours after the update to v0.33.4 (I don’t keep persistent logs):

@rock64:~$ docker logs storagenode 2>&1 | grep "deleted" -c
8758
@rock64:~$ docker logs storagenode 2>&1 | grep "delete failed" -c
379
@rock64:~$ sudo du -h /mnt/storj1/v3alpha/storage/trash --max-depth=0
54G     /mnt/storj1/v3alpha/storage/trash

This makes for a failure rate of about 5%. I wonder how much of those 5% of failures were during the initialization after upgrading. I’ll give it a week and revisit, see if the percentage changes (if there are no updates between now and then). There was probably data in the trash folder before the update, so I doubt that it was all from the last 24 hours. For context the 54GB of trash is out of 2.1TB of stored data.

Why would you have a context cancelled on a delete failed event error?

I’ve had errors where files couldn’t be deleted because the file was not there, but a context cancelled? Is that a delete mistake? Competition to who can delete the files the fastest wins?

My understanding is that the Satellite waits for a response from the node for a specified amount of time, and if it doesn’t get a response in that time you get the context cancelled error. I don’t think this necessarily means that the delete didn’t happen, just that your node didn’t respond before the timeout. Edit: See post 56 below

This should get fixed over time.

du -sh trash/
20G     trash/


sudo docker logs storagenode 2>&1 | grep "deleted" -c
26891


sudo docker logs storagenode 2>&1 | grep "delete failed" -c
774

running v0.33.4

this is what littleskunk says:

Slow Deletes
Fixed with v0.31.12 but we have a new bug now. The satellite is to slow to communicate with all storage nodes and will drop most of it. GC has to handle it.

1 Like

After about a week of consistent uptime I get the following:

@rock64:~$ docker logs storagenode 2>&1 | grep "deleted" -c
96736
@rock64:~$ docker logs storagenode 2>&1 | grep "delete failed" -c
2021
@rock64:~$ sudo du -sh /mnt/storj1/v3alpha/storage/trash
109G    /mnt/storj1/v3alpha/storage/trash

A failure rate of 2%.

edit: fixed formatting

1 Like