Downloaded size does not match received message size

I am aware of the other topics Downloaded size (0 bytes) does not match received message size (512 bytes) and Downloaded size and received size don’t match where I even provided an answer myself.

However this one strikes me as like 99 out of 100 the received bytes is the same value of 262144 bytes out of 313600 bytes. In fact when it gets canceled for this reason it is at 99,85% with this specific byte value shown.

This appears strange to me as if there might be something to fix. Because even if the node might be slow, why does it get canceled pretty much always at the same byte figure?

Bad sector on disk exactly at that place in a popular file?

262144 bytes is exactly 256 kilobytes (256 x 1024) so perhaps that’s a common boundary to stop on?

1 Like

The result of the cancelation on the client side.

However, checking the disk should not harm too.

Actually the download is successfully completed many times. So the sector/disk should be fine.
However I have all kinds of error messages regarding this piece:

Download is canceled with: "downloaded size does not match received message size", "context canceled", "stream closed by peer" and even unknown reason bug in code, please report.

Interesting…
Other values I see are e.g. 294912 bytes, 278528 bytes, 0 bytes, 311296 bytes. But these are only like 0,15% currently. To me it weird that it is so massively concentrated on that 262144 bytes value where it gets canceled.

Is this a single piece or many different pieces?

This is for a specific piece.

A bad sector may be a soft error (meaning: sometimes works, sometimes not), and then a successfully read sector is cached (so at least for some time queries succeed).

Alternatively, the link between the specific customer who repeatedly downloads the piece and you is occasionally slow, and the specific size is just the amount of data that your node manages to send before getting informed about the lost race. The fact the number repeats might just come from the TCP slow start or buffer bloat effects in devices close to you.

I wouldn’t care much about the latter if in general your node performs well. And bad reads will be visible in SMART data, so it’s easy to check. Might be just as well something else, these two I would say are the most probable though.

1 Like

Please copy a full error message with a stack trace.

The node is running on info level. I think there is not stack trace on that level.

That was the initial answer. However the repeated appearance of the same byte number at which the race is lost appears to be far from random which is what I would have expected.

You would be surprised how repeated some measurements can be over computer networks. The canonical case study is the 500-mile email story.

2 Likes

I opened an issue to try to identify which other errors aren’t classified on the storagenode download path and provide a proper informative message for them storagenode reports info message "unknown reason bug in code, please report" · Issue #7534 · storj/storj · GitHub

This specific error message was introduced by me in commit storagenode/piecestore: Add reason download cancellation · storj/storj@3ca9625 · GitHub

When I inspected the code, I didn’t find more specific error message out of the ones classified in the switch statement. I added a default branch to it to prints out this message in case that aren’t identified errors. Now that’s appearing, we should revisit to see if we can identify other errors and provide an informative message for each of them.

2 Likes

Just to add for for information:
I see this error on other nodes as well: download_cancel_unknown_reason_v1{action="GET",scope="storj_io_storj_storagenode_piecestore",field="high"} 899

Yeah, seeing it on a regular basis as well.