Upload "Remote Address" use docker Client public IP and not "ADDRESS=" endpoint

Is this supposed to have been fixed?

Because I have just checked again on one node with this error message on two randomly selected pieces and again the pieces have been delivered into the correct blobs folders while the uploads have been logged as failed.

This is a very unpleasant situation. As this error message shows up very frequently in the logs making it impossible to judge if the node has an issue or if pieces are successfully uploaded.

Interesting edit: It seems that the node got updated couple of hours ago. I see no a new error message:

INFO piecestore upload canceled (race lost or node shutdown) {"process": "storagenode", "Piece ID": "XXXX"}

However I don’t know if this is helpful. First of all, there is no indication of the satellite any longer in that message. So if I want to check if the file has been uploaded or not, I have to check for it on every satellite now.
And still after I have checked I see the piece has been uploaded.

So basically nothing has been solved and the questions are still the same:

  1. Why is this considered a failed upload when the piece is present in the correct blobs folder?
  2. Why does the piece not get outright deleted resp. moved into the trash folder, if it is a failed upload?
  3. If the upload is considered not successful this means I don’t get paid for the space it occupies, there is no chance for egress on this piece and it is simply wasting space.

And just to give an idea about the size of this problem.

I have grepped the logs for the new wording for that error message:

grep "node shutdown" | wc -l
11789

And that is just since the node has been started to produce this message a couple of hours ago.
So I need to know if these over 10k pieces in a couple of hours are indeed failed uploads, that are just wasting space and have absolutely no purpose in whatsoever being there.

I guess your node was not fast enough to report this to the uplink before it canceled the connection…

Because technically your node was able to receive it at whole, so no reason to delete. But I’m agree, if the uplink canceled the connection, the node should mark it as failed right away, but I guess it’s more optimistic - if the uplink did not confirm it right away, it’s waiting for a confirmation some more time, but then “forget” about it. From the network perspective, these pieces would be deleted with the Garbage Collector anyway, why to bother on another roundtrip to message the node that’s this piece upload was canceled… the absent of this message is speedup an upload process greatly. So…

This does not convince me yet. First of all you are guessing, but that’s fine because I don’t know the reason for sure either. But you have seen the numbers. I personally find it very unlikely that the node is fast enough to download and save the piece on disk but too slow to report it to ? on time. I mean of how many microseconds we are talking here? And all this 11k times in a couple of hours?

But this is what I don’t understand: The node software is telling me that the upload has failed. So there is no reason to keep this piece. So why keep it and not delete it right away? Maybe I don’t understand all the background stuff right that is happening, but when the node knows already that the pieces is garbage (otherwise it would not be able to tell me that the upload has failed) then there is also no need for additional roundtrip messages or something. It could be and should be deleted right away.

Which garbage collector would be responsible for deleting such a failed upload piece? How long will it remain in the blob folder? I have wrote down that specific piece I have checked today and will try to follow up on it to see how long it remains there.

The same, which always was there. The satellite will send a bloom filter to your node, and the node will delete pieces (90%) which shouldn’t be on your node.

I see, so its the bloom filter. And this is only sent once a week, right? And then it is not deleted right away but moved to trash where it remains for another 7 days?

Seems so. Perhaps it’s a simplest solution.

Per today the piece that was considered and logged as failed upload is still there since Aug 26th.

How did you check that?
Do you have errors related to a garbage collector (search for gc-filewalker and ERROR)?

I have the piece id/name and I can see the piece in the blobs folder with date 26th of August.

I don’t have any gc-filewalker errors. But when I grep for “bloomFilter” I get only log messages displayed for satellites GE, A6 and Ds. The piece in question is from the 3S satellite, maybe it has not yet sent the bloomfilter? I don’t know.

All I can say is, the piece is still there, unpaid for a week now as it seems wasting space which sounds really really bad especially if the node would be full. Luckily it isn’t at the moment.

My node received a bloom filter from this satellite on 2023-08-29T17:32:58Z for pieces created before 2023-08-23T17:59:59Z:

2023-08-29T17:32:58Z    INFO    lazyfilewalker.gc-filewalker.subprocess gc-filewalker started   {"process": "storagenode", "satelliteID": "12EayRS2V1kEsWESU9QMRseFhdxYxKicsiFmxrsLZHeLUtdps3S","process": "storagenode", "createdBefore": "2023-08-23T17:59:59Z", "bloomFilterSize": 2097155}

So, I guess your node did not receive it yet, as mine.
The previous message was on 2023-08-22T15:51:03Z.

I am rotating logs and it seems that I don’t have any before 30th of August.
So I can’t tell if the node has received a bloom filter from that satellite between 26th and 30th.

News from today: A new bloom filter has arrived for the 3S satellite. The filewalker process has completed without error. AAAAAAAAAAAAAAAnd

the file that was logged as failed upload is still there.

Something is not going well here.

That’s how bloom filter works, it’s probabilistic, but has an important property: it can exhibit false positives, but never false negatives.

IN other words, filter will tell the node -either “this chunk is definitely not needed” or “this chunk is probably still needed”. That your chunk falls into second category: can be deleted but filter is not sure it’s definitely can be deleted. so it isn’t. Will try next time.

A piece that has been uploaded and is already known as failed upload, should not make it to the blobs folder at all. Now it is sitting there already for weeks. Occupying space probably even unpaid. Such an implementation sounds terribly wrong.

What’s error? Could you please copy it from the logs?

Error on upload is:

INFO piecestore upload canceled (race lost or node shutdown)

Please refer to my post on bloom filters, and the following thread that specifies some Storj-specific details. Any piece that is mistakenly held by your node has some non-zero probability to stay after a given bloom filter. This probability is reduced exponentially by successive bloom filters, making it very not likely to survive 2 or 3 bloom filters.

I don’t recally the exact probability that Storj bloom filters are tuned for, I’m a little drunk today, sorry. But if it is, let say, 10%, then after two bloom filters it would be 1%, and after three it would be 0,1%.

BTW, are you sure this bloom filter was for the satellite this piece came from?

1 Like

The piece came via the 3S satellite and the bloom filter I was referring to came from that satellite. So yes I am sure.

I am not saying the bloom filter does not work as intended. My point is that the piece should not be there at all. In the second after the piece was moved to the blobs folder it was clear it was a failed upload. It was indicated in the log. The piece does nothing else but wasting space that I assume I don’t get paid for. It does not get audited, it does not get downloaded. It does not even get deleted when the customer deletes the file it belongs to.
And now with the bloom filter doing the deletes we can see that it obviously can take weeks until this faulty piece gets removed. And this for a piece where we know since August 26th (the day it arrived) it should not be there anymore.

When I do a grep "node shutdown" | wc -l on the logs to count the lines indicating the same error on that very same node I get a whopping number of 439561.
Without further investigation that could mean that there are 439560 more pieces sharing the same story, wasting space for weeks that could have been freed instantly.

Ok, can you find an example of a piece that went through at least three garbage collection cycles and is still stored on your node?

1 Like