On the graceful exit issue I found one more:
The storage node will submit graceful exit success in one message at the end of a batch. Graceful exit failures are not batched. The storage node will submit them one by one. Less powerful systems / routers can get overloaded by the number of connections. This creates a cylce. In the next batch the storage node will fail even more transfers which will increase the impact of this problem until the storage node finally gets disqualified for too many failures.
I am now 99% sure exactly that is the big issue in production. I will put it on the top of list.