However this one strikes me as like 99 out of 100 the received bytes is the same value of 262144 bytes out of 313600 bytes. In fact when it gets canceled for this reason it is at 99,85% with this specific byte value shown.
This appears strange to me as if there might be something to fix. Because even if the node might be slow, why does it get canceled pretty much always at the same byte figure?
Actually the download is successfully completed many times. So the sector/disk should be fine.
However I have all kinds of error messages regarding this piece:
Download is canceled with: "downloaded size does not match received message size", "context canceled", "stream closed by peer" and even unknown reason bug in code, please report.
Interesting…
Other values I see are e.g. 294912 bytes, 278528 bytes, 0 bytes, 311296 bytes. But these are only like 0,15% currently. To me it weird that it is so massively concentrated on that 262144 bytes value where it gets canceled.
A bad sector may be a soft error (meaning: sometimes works, sometimes not), and then a successfully read sector is cached (so at least for some time queries succeed).
Alternatively, the link between the specific customer who repeatedly downloads the piece and you is occasionally slow, and the specific size is just the amount of data that your node manages to send before getting informed about the lost race. The fact the number repeats might just come from the TCP slow start or buffer bloat effects in devices close to you.
I wouldn’t care much about the latter if in general your node performs well. And bad reads will be visible in SMART data, so it’s easy to check. Might be just as well something else, these two I would say are the most probable though.
The node is running on info level. I think there is not stack trace on that level.
That was the initial answer. However the repeated appearance of the same byte number at which the race is lost appears to be far from random which is what I would have expected.
When I inspected the code, I didn’t find more specific error message out of the ones classified in the switch statement. I added a default branch to it to prints out this message in case that aren’t identified errors. Now that’s appearing, we should revisit to see if we can identify other errors and provide an informative message for each of them.
Just to add for for information:
I see this error on other nodes as well: download_cancel_unknown_reason_v1{action="GET",scope="storj_io_storj_storagenode_piecestore",field="high"} 899