Uplink: failed to upload enough pieces (needed at least 80 but got 78)

The backup failed (uplink -p4) last Sunday 9th of December:

root-node-1  |  214GiB 0:55:08 [89.9MiB/s] [66.5MiB/s] [>                    ]  5% ETA 15:40:42
root-node-1  | failed to upload part 213: uplink: encryption: metaclient: manager closed: closed: read tcp 172.19.0.3:42882->34.172.100.72:7777: read: connection reset by peer
root-node-1  | error getting reader for part 215: context canceled
root-node-1  | failed to upload part 212: uplink: failed to upload enough pieces (needed at least 80 but got 72)
root-node-1  | failed to upload part 214: uplink: context canceled
root-node-1  |  215GiB 0:55:11 [66.5MiB/s] [66.5MiB/s] [>                    ]  5%             
root-node-1  | 

And today (16 December 2023):

root-node-1  | 24.6GiB 0:07:15 [62.5MiB/s] [57.9MiB/s] [>                    ]  0% ETA 19:07:57
root-node-1  | 25.0GiB 0:07:20 [76.3MiB/s] [58.1MiB/s] [>                    ]  0% ETA 19:03:45
root-node-1  | failed to upload part 15: uplink: metaclient: manager closed: closed: read tcp 172.19.0.3:55320->34.172.100.72:7777: read: con
nection reset by peer
root-node-1  | error getting reader for part 25: context canceled
root-node-1  | failed to upload part 23: uplink: encryption: context canceled
root-node-1  | failed to upload part 24: uplink: context canceled
root-node-1  | 25.0GiB 0:07:21 [58.0MiB/s] [58.0MiB/s] [>                    ]  0%      

So Iā€™ve restarted it again with uplink -p 2 now.


Good point!
Though we are not using the firewall on our MikroTik router.

[admin@MikroTik] > /ip/firewall/connection/print 
<nothing>
[admin@MikroTik] > /ip/firewall/filter/print 
Flags: X - disabled, I - invalid; D - dynamic 
[admin@MikroTik] > /ip/firewall/nat/print 
Flags: X - disabled, I - invalid; D - dynamic 
[admin@MikroTik] > /ip/firewall/mangle/print 
Flags: X - disabled, I - invalid; D - dynamic 
[admin@MikroTik] > /ip/firewall/raw/print    
Flags: X - disabled, I - invalid; D - dynamic 
[admin@MikroTik] > 
[admin@MikroTik] > /ip/firewall/connection/tracking/print 
                   enabled: auto
               active-ipv4: no
               active-ipv6: no
      tcp-syn-sent-timeout: 5s
  tcp-syn-received-timeout: 5s
   tcp-established-timeout: 1d
      tcp-fin-wait-timeout: 10s
    tcp-close-wait-timeout: 10s
      tcp-last-ack-timeout: 10s
     tcp-time-wait-timeout: 10s
         tcp-close-timeout: 10s
   tcp-max-retrans-timeout: 5m
       tcp-unacked-timeout: 5m
        loose-tcp-tracking: yes
               udp-timeout: 10s
        udp-stream-timeout: 3m
              icmp-timeout: 10s
           generic-timeout: 10m
               max-entries: 28672
             total-entries: 0

AFAIK, the connection tracking feature primarily comes into play when the router needs to keep track of the state of network connections, which is typically necessary for certain types of firewall rules.

We donā€™t use Firewall so the connection tracking limitation can be ignored I guess, nor we have QoS. We have pretty much default and very simple MikroTik router config.

So it might be anything else I guess (e.g. FW/SW/HW, cables, or ISP/upstream issues).

The temperature is less than 80Ā°C which should not be a concern for MikroTik CRS326-24S+2Q+ we have. And the network load is 0% despite it being making the backup right now with Storjā€™s uplink at this very moment.

What about nat table size? Or max connections on the client? Else it would be likely indeed upstream (isp modems/bridhes/etc)

Thatā€™d be the max-entries & total-entries as shown above (/ip/firewall/connection/tracking/print & /ip/firewall/nat/print)
We donā€™t use the firewall/NAT.

I guess thatā€™d then be something else.

Can you temporary use a different router?
I have several reports about Mikrotik routers which drops connections for some reasons. This was for SNOs, who invoked an old version of the Graceful Exit, when the node attempted to transfer several TB of data to other nodes and GE always failed because of too much failed transfers (thousands). As soon as they connected internet without Mikrotik - GE was finished with zero failed transfers.

2 Likes

Great point!

As we donā€™t have alternative router there, I went ahead and upgraded our MikroTik router (CRS326-24S+2Q+) firmware from 7.10.2 to 7.12.1, then to 7.13.1.
Now Iā€™ve ran the uplink with -p 8 -t 8 - lo and behold, 2.78TiB have been successfully backed up at the Storj! :rocket:

2 Likes

Iā€™m glad to read, that the upgrade of the firmware solved an issue!
I hope itā€™s permanent and would not require restart of the router every time (as many other cheap SOHO routers) to make it working normally with multiple parallel connections.

Would it be possible to make uplink retry the failed uploads? I mean if 79 pieces were successfully uploaded, get one more node from the satellite or retry the failed ones a few times. It would suck if uploading a large file failed because at some point some connections failed.

1 Like

We always working on improvements of our tools to speed up uploads and downloads and to processing of nodesā€™ failures during upload/download.

But if you want to have this feature now, you may use rclone instead, itā€™s already has this feature.

Cool, I just noticed that threads about incomplete uploads popped up once in a while and though it would be nice if uplink could just retry the failed piece uploads instead of forcing the user to retry uploading the whole file.

1 Like

Itā€™s not easy. Otherwise it would be implemented. We do not want to implement it like itā€™s implemented in rclone - retry the whole file, because for that you may use rclone.

Sadly the errors are back :frowning_face:

root-node-1  |  507GiB 2:02:02 [7.36MiB/s] [70.9MiB/s] [=>                   ] 12% ETA 14:24:43
root-node-1  |  507GiB 2:02:07 [ 147MiB/s] [71.0MiB/s] [=>                   ] 12% ETA 14:23:54
root-node-1  | failed to upload part 505: uplink: encryption: metaclient: manager closed: closed: read tcp 172.19.0.3:56072->34.172.100.72:7777: read: connection reset by peer
root-node-1  | error getting reader for part 508: context canceled
root-node-1  | failed to upload part 504: uplink: failed to upload enough pieces (needed at least 80 but got 78)
root-node-1  | failed to upload part 506: uplink: encryption: context canceled
root-node-1  | failed to upload part 507: uplink: context canceled
root-node-1  |  508GiB 2:02:10 [71.0MiB/s] [71.0MiB/s] [=>                   ] 12%   

See the sent traffic on the screenshot, logged on the proxmox (blade1) host, during which weā€™ve encountered the RST (connection reset by peer).

Iā€™ll ask the ISP to see if they can find any issue on their end.
Otherwise weā€™ll likely have to either restart the router every once before the backup (Iā€™d call this as not acceptable workaround though) or get another router perhaps :man_shrugging:

Unfortunately your ISP likely cannot solve the issue. If the router started to drop connections again - itā€™s a hardware/firmware issue of the router.
The temporary solution would be to restart it every time when it starting to happen againā€¦ Iā€™m sorryā€¦

In this case you may also switch to S3 integration, if possible, it uses much less parallel connections, but it will use a Server-Side Encryption.

Alternatively, consider scheduling the backup for a weekday. I have observed that the backup process consistently succeeds on weekdays (Monday to Friday), whereas it tends to fail during the weekend.
Additionally, restarting the router is not necessary since the backup successfully completes on the weekdays.

This is very weird. Perhaps the router can flush its state after some time?
Itā€™s very unusual, that you didnā€™t get enough nodes on a specific days.

Do you maybe know, which days are bad?

Thought Iā€™d share the update: We havenā€™t had any network related errors (during Storj upload) since our ISP had an emergency hardware maintenance 9 days ago (09 Feb 2024), even the weekend backup worked fine (2.91 TiB uploaded with avg speed 78.9 MiB/s).

5 Likes

It appears that the issue has returned, interestingly after there was a short ISP outage (~5-10 mins) on April 2nd, 2024, around 4:45 PM UTC.
The same uplink version - v1.90.2 and -p 4 -t 4 --progress=false args.

  • April 16th:
root-node-1  | failed to upload part 1350: uplink: encryption: metaclient: manager closed: closed: read tcp 172.19.0.3:43990->34.150.199.48:7777: read: connection reset by peer
root-node-1  | error getting reader for part 1353: context canceled
root-node-1  | failed to upload part 1351: uplink: encryption: context canceled
root-node-1  | failed to upload part 1352: uplink: context canceled
root-node-1  | 1.32TiB 5:04:20 [ 128MiB/s] [75.9MiB/s] [=====>               ] 29% ETA 11:58:34
  • April 20th:
root-node-1  | 1.23TiB 4:44:56 [88.4MiB/s] [75.3MiB/s] [====>                ] 27% ETA 12:34:11
root-node-1  | failed to upload part 1256: uplink: encryption: metaclient: manager closed: closed: read tcp 172.19.0.3:45308->34.150.199.48:7777: read: connection reset by peer
root-node-1  | failed to upload part 1255: uplink: metaclient: context canceled
root-node-1  | error getting reader for part 1258: context canceled
root-node-1  | failed to upload part 1257: uplink: context canceled
root-node-1  | 1.23TiB 4:45:00 [75.3MiB/s] [75.3MiB/s] [====>                ] 27%   

The ISP DevOps Network Engineer replied to this:

The connection reset was sent from Google. This would be nearly impossible for us to provide details on this event.

I guess we are out of luck here. :man_shrugging:

Will try switching -p 4 -t 4 --progress=false back to -p 2 --progress=false.

It could be related to a routing.
Could you please show the MTR results for the gateway.storjshare.io (if you still uses the Gateway-MT instead of a native, where you will not have such issues at all)?
You may use a DM, if you are not comfortable to expose your IPs publicly.

Hey Alexey :wave:
There you go: mtr gateway.storjshare.io

                                        My traceroute  [v0.95]
akash-rpc-archival (184.105.162.171) -> gateway.storjshare.io (136.0.77.2)    2024-04-23T09:40:00+0000
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                              Packets               Pings
 Host                                                       Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. ve1208.core1.fmt2.he.net                                74.7%    76    0.6   3.8   0.4  27.6   7.0
 2. port-channel2.core6.fmt2.he.net                         23.7%    76    0.5   4.1   0.4  35.5   8.0
 3. port-channel12.core2.pao1.he.net                        86.7%    76    1.2   1.7   1.1   4.3   1.0
 4. 64.62.244.149                                            0.0%    76    0.9   1.2   0.9   3.8   0.5
 5. (waiting for reply)
 6. datacamp-218.losangeles2.loa.seabone.net                 0.0%    76    8.4  10.2   8.3  25.8   3.5
 7. vl224.lax-cs2-dist-2.cdn77.com                           0.0%    75    7.2   8.6   7.2  24.5   3.2
 8. 136.0.77.2                                               0.0%    75    7.7   8.9   7.6  25.1   3.0

We are using HE.net provided gateway directly, but through the MikroTik router as we always have. We have never directly attached the server to the HE.net gateway port (around the MikroTik router), and itā€™s been working at times flawlessly, without the need for MikroTik router reboot.

Update: Hmm I can see some packet loss there, even against the Google (mtr google.com):

                                                My traceroute  [v0.95]
akash-rpc-archival (184.105.162.171) -> google.com (142.250.191.78)                           2024-04-23T09:43:53+0000
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                                              Packets               Pings
 Host                                                                       Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. ve1208.core1.fmt2.he.net                                                66.7%    51   19.4   2.7   0.4  19.4   5.5
 2. 100ge1-2.core2.fmt2.he.net                                               0.0%    51    7.3   2.8   0.3  21.7   5.6
 3. as15169.sfmix.org                                                        0.0%    50    0.9   0.8   0.7   1.6   0.1
 4. 142.251.254.241                                                          0.0%    50    1.3   1.2   1.1   1.5   0.1
 5. 142.251.224.31                                                           0.0%    50    1.3   1.2   1.2   1.4   0.0
 6. nuq04s43-in-f14.1e100.net                                                0.0%    50    1.2   1.1   1.1   1.5   0.1

Checking with the HE.net.

HE.net ISP replied:

Both of these traces are showing no loss to the destination. Observed packet loss on our hops (and even Googleā€™s servers) are subject to control plane rate limits. As this doesnā€™t continue to the destination, itā€™s only cosmetic and not indicative of an issue.
https://www.cloudflare.com/learning/network-layer/what-is-mtr/

I have passed your information to the team.

But I know, that HE is often has broken routing unfortunately (they often (very) didnā€™t agree with other providers).