Possible problem with v1.60.3 and docker in Synology. Anyone else?

Hi

I noticed a big drop in my egress traffic this month (BTW, when was the v1.60.3 deployed?). At first, I thought it was just a holyday thing, but then I noticed that refreshing the node stats page takes 1 or 2 minutes, which is abnormal.
Then I noticed that the node was using ~8GB of memory, and since the synology is using about 4 GB of RAM, the node must be disk caching like crazy, which explains the delay in refreshing the page and the low egress.
Anyone noticed this on their syno?
What could be the problem and how could I diagnose it better?

tx

This has been observed by other people, so probably unrelated: Bandwidth utilization comparison thread - #1665 by storaje

Regarding your node problems, the most typical scenario in which a storage node consumes large amount of memory (and coincidently getting its UI slow) is when the drives are not fast enough to cope with traffic. By large I mean more than a gigabyte. This might be the case with SMR drives, using write-inefficient parity schemes or using btrfs as the file system, and can be worked around by moving databases to a separate, faster storage (maybe an SSD, maybe just a separate drive) and reducing the allowed number of concurrent connections.

I donā€™t know what kind of diagnostics does Synology provide, but on Linux the easiest tool to confirm or deny this scenario is to run iostat -dmx 30 and observe the %util column. If it is consistently above 50%, that would be it.

2 Likes

Yep. I do have 90ā€™s and even 100ā€™s on the %util column on the sataā€™s and some mdā€™s lines.
Anyway, I donā€™t really get it. I do use btrfs (on sata disks, not smr), but Iā€™ve always used btrfs and only now I have this high memory problems.
I canā€™t move the DB to a single SDD. I donā€™t trust SDDā€™s. It would have to be a raid 1. Iā€™m not using 2 bays just to hold the STORJ db, and plus, there is the cost of 2 SDDā€™s.
I could reserve 8GB RAM to the STORJ dock, though. Do you think that might solve everything (in case the low egress is due to cache increase)?

I recall in the btrfs thread I linked that initially it was also fine, up to some point. Though, in my case I noticed that earlier, as I was annoyed by nodes not shutting down quickly. I donā€™t remember the memory usage at that point. If I had to guess, Iā€™d say that it was because of database updates, as each upload and download triggers an insert or two, and the frequency of downloads grows with the size of a node. As such, I might have just passed some saturation point at which the file system started lagging.

Note that the databases are not critical for storage nodes. They only collect runtime statistics or temporary information whose loss is not vital to operate a node. Databases can be regenerated at any time, too. Hence storing them on less reliable medium is not necessarily a bad idea.

I doubt it, though I donā€™t have any specific arguments. If the memory usage grows due to lagging writes, it will also easily grow beyond any specific memory reservation. Itā€™ll likely be better to use the storage2.max-concurrent-connections, add some block-level caching (like bcache/lvmcache, in theory even offloading IO reads with a writethrough-style cache should help), or maybeā€”though this is still a hypothetical at this time whether it would helpā€”wait for someone to implement this change.

2 Likes

in 95% of all cases excessive memory usage is due to HDD being unable to keep up.
tho if the storagenode isnā€™t getting enough cpu that can also lead to excessive memory usage.
(rarely the case because storagenodes generally doesnā€™t use much compute.)

the memory usage will just keep going up until the issue is resolved or ingress / egress traffic drops, a single node should max use like 1 or 2 GB of memory and on avg stay between 50-250MB

moving your databases to SSD or using an SSD as cache for your storage, which is possible within the synology ecosystem.
no idea how to configure that tho.
@BrightSilence runs with a SSD cache on his synology storage setup, maybe he can give you a few pointers on how to do thisā€¦

personally i would go with the cache, as moving the storagenode DBā€™s to SSD storage does create the issue that if either the SSD or the HDD/'s fail you will have issues with the node.

however if a cache drive fails, you are basically just back to running without itā€¦
which i would consider preferable.

storagenodes create a ton of IO, mostly for various tasks such as database entries and what notā€¦ a cache will reduce this amount greatlyā€¦ but anywhere from 50-60% or more

or you could consult the synology guides on their website, they seem to have quite a few of those, if memory servesā€¦

Well, I did have a 250GB nvme on my syno as cache. But it has died (little more than 1 year of life). I might buy another one and test if the node goes back to normal.
The thing with syno cache is that you canā€™t programme the cache for what you want. You only get to assign it to a specific volume. I have lots other stuff going in my syno and I have no way of knowing if the cache is serving the node. Maybe I could create a small volume to hold the storage node DB and then assign the cache to that volume onlyā€¦
What would be a reasonable size for a DB volume?

My largest database files are 2-3MB. So I suppose 100MB would be enough for several nodes.

Well, I do have bandwidth.db-wal at 97MB, bandwidth.db at 46MB and piece_expiration.db at 16MB.
Anyway, the minumum volume I can create is 10GB.
So, I will get a nvme replacement of 250GB to serve as cache for my 10GB volume filled with less than 200MBā€¦ canā€™t shake the feeling there should be something smarterā€¦

My database files are sometimes some hundred megabites in size. And they are constantly growing, so you should plan for some GB I think if youā€˜re planning long term.

endurance of the SSD would be my priority, caching can be very wear heavyā€¦

Right, the wal files! Sorry, didnā€™t count them. Indeed, theyā€™re larger.

Sure. I had a Samsung 970 MLC with 5 year warranty. I thought I couldnā€™t do better. It lasted a year. 100% used up.

the 250GB model only has 150 TBW,

850million writes of a minimum sector size of 4k, would be 3.4TB
and the 4.17B read would be equal to a minimum of 16.7TB
so a total of 20TBW used minimum, granted reads isnā€™t always considered wear depends a bit on the brandā€¦

does seem like its a 4K sector based drive, but its possible you have been running 8k or 16k sectors on itā€¦ if so that would double or quadruple the wear.

also doesnā€™t say how much internal data sorting it has doneā€¦ it could be that the drive has been filled to close to capacity which makes it shuffle data around internally, a lotā€¦
which will lead to the drive dying early.

these drives are basically consumer drives afaik, i wouldnā€™t expect them to last to longā€¦
iā€™m caching at about 1MB/s , so 3600MB an hour.
so 86GB a day, or 31.5TB a year.

but thats just the avg over the 24 days, which has been quite slowā€¦
checked a bit further back and does seem like during high activity i hit close to 2MB/s cache writes.

have seen peaks with like +4MB/s for a single node, i think this is in part due to data amplification from database writes, since any write will take up 4k because of the minimum sector size.

but i digressā€¦ i would assume 1 to 2 MB/s avg cache workload for a storagenode.

so that would wear out an 970 EVO at 250GB in like two years, ofc that not sustainedā€¦
not sure i would recommend that as a cache drive.
generally the issue is that much is cached and not used again and then just thrown out.
depends on the cache mechanics, i supposeā€¦ so YMMV

for cache i would use a high endurance enterprise drive with PLP and good write iops.
ofc the iops is more dependent on the amount of workload you want to put on it.
i like having good Q1D1 but that can be tricky.

have had good luck with older intel enterprise drives, their DC 3600 series is quite excellent.
i got a 1.6TB Intel DC P3600 i thinkā€¦ tho might be a DC P3610, it has a rating of >40 PBW
you can also get DC S3600 which is the sata versionā€¦ might be able to get some m.2 version of the newer ā€¦ like say DC P4600 but it drive wear is much lowerā€¦ but still in the 5 or so PBW for a 1.6TB SSD

the DC S3600 comes in a 400GB version with like 5 PBW or more i guessā€¦ maybe its 10 that would fit with the 1600GB version having about 40PBW because the PBW scales linearly with capacity.

been a while since i looked at the exact specs for them.
also have a DC S4600 which oddly enough is slower in iops than the DC P3600.
this might be down to latencyā€¦ due to the higher bandwidth on the PCIe version.

since the DC 4600 series should far exceed the DC 3600 series in performanceā€¦ and even tho SATA can keep up with the iops, then it does end up doing less iops, even if the bandwidth used is like 10-20MB/s

didnā€™t really expect thatā€¦ and no iā€™m not hitting the SATA iops limit, since that exceeds like 120k iopsā€¦ but for some reason doing like Q1D1 it will cap out much earlier on the SATA rather than the PCIE even tho the 4600 series is far superior ā€¦

so yeah i dunoā€¦ maybe its the SATA controller throwing me a curve ballā€¦

also got a Fusion IO 1.6TB PCIE SSD but i have made that into a L2ARC drive, because it was wearing out to quickly, even tho it has like 5. something PBW

but i do load my cache pretty heavilyā€¦
cache is heavy workloadā€¦ and should be well considered in advance.

on another noteā€¦ i donā€™t think your drive wore outā€¦
because it never used any of its reservesā€¦ which is kinda weirdā€¦
usually it would get bad sectors / blocks from wear and then use from its 10% reserve capacity.
until it ran out and then it would be worn outā€¦
which can be further extended by overprovisioningā€¦ but not sure if thats possible on consumer drivesā€¦ depends on their firmware / controller i think.

I donā€™t think my ssd was used up. It just went bad.
The ssd was always at full capacityā€¦ after a few days, of courseā€¦
Anyway, I bought it from amazon at 80ā‚¬. They offered a refund. I bought a new one (exactly the same) for 45ā‚¬. Soā€¦ good thingā€¦

1 Like

tried the overprovisioning thing with my fusion iomemory ssd, to try and make it last longerā€¦ but didnā€™t seem to matterā€¦ not sure if i did it wrong or somethingā€¦

didnā€™t exactly follow any online guides to itā€¦ but i canā€™t imagine how they would do it better than what i didā€¦
my main problem with it was that it started doing a ton of internal writes wearing itself out even tho running at quite low sustained writes and not filledā€¦

all this stuff is always a very educational experience lol, and always more subjects to expand upon.

basically the idea with overprovisioning comes down to that SSDā€™s shuffle data around internally, to convert it from SLC to MLC, TLC or QLC statesā€¦ this means lots of extra writes and the less room the SSD has to work with the more it can start to shuffle data around.

so overprovisioning is like setting a 1.6TB SSD to like say 1TB of used capacity to be sure it has enough free space to easily move data around internallyā€¦

like say from SLC to MLC that would require 3 GB for each 1GB of data and with TLC or QLC it just gets even worseā€¦ TLC being 4 to 1 and QLC 5 to 1.
so each 1GB written takes up 4GB QLC (running in SLC mode) of space until its written /converted into 1GB QLC

so one can easily see how a disk could quickly run out of capacity internally when doing thatā€¦
upload 10GB and it takes up 40GB and needs a further 10GB before it can delete the 40GB temp data.

and if then it starts to run low on capacity all the data gets scattered into all the little free block of capacity making the workload more demanding because it gets nearer to random read / writes.

and then on top of that, it might need to split it into batches due to lack of spaceā€¦ making it start to wear the drive.

SSDā€™s are not always better than HDDā€™s even tho they are in most aspects lol
the wear on SSDā€™s certainly is very different and something we donā€™t think much about.

The SSDs will keep dying; on Synology it is a poorly thought out misleading marketing gimmick. My advice would be to not use it at all and instead add more ram.

There are other problems with it, besides significant wear.

This Reddit comment summarizes the issues with it: Reddit - Dive into anything

If the random IO is a bottleneck ā€” you can turn off checksumming on the volume, or reconfigure the pool with a different layout.

Got the pointā€¦
Never thought SSD caching on synos could be such a bad ideaā€¦
Anyway, now I have a brand new 250 nvme. Iā€™ve decided to start by assigning 100GB to the main volume, not to the special storage node database, which I havenā€™t created yet. I wanted to test if the RO cache on the main volume would take care of my problem.
It kind of did. As soon as restarted the NAS, the ā€œRAMā€ (fake!) of the storage node container started to rise and the used cache on the nvme also. The container ā€œRAMā€ eventually went to 6GB, but then started to decrease (as the nvme cache went to 50GB) and sits now at 1.5GB. This is still too high, even because Iā€™ve read the whole syno docker is limited to 2GB real RAM. Hopefully, it will decrease moreā€¦

The problem with the main advice in the links youā€™ve shown (ā€œIncrease system RAM instead of using nvme cache!ā€) is that the syno docker wonā€™t use all the system RAM. I have lots of free RAM waiting to be used, but the containers wonā€™t touch it.

Regarding expensive Intel cache SSDā€™s, since Iā€™m using it just because of storj, it makes no sense to buy it. Iā€™ve never bought anything for storj. Even the nvme was first intended for a desktop. I inserted it in the NAS because eventually I didnā€™t need it for the desktop. It was just lying aroundā€¦
If you buy stuff specifically for storj, youā€™re wasting moneyā€¦

Regarding your advice, I donā€™t see how to disable checksum on a volume. As far as I know, you can disable it in each shared directory (?!?!).
And how would a different pool layout help? Can you give me an example?

PS- Reloading storage node stats page takes 1 second nowā€¦ :slight_smile:

Oh, thatā€™s an another can of worms. Synology uses outdated fork of docker that exhibits tons of issues, including mis-reporting memory use. Donā€™t believe what it reports.

On one hand, you donā€™t have to use docker ā€“ storj binaries are self-contained executables (like anything else written in go) and therefore donā€™t benefit from dependency isolation containers provide. You can run storage node natively (with systemd on DSM7 or upstart on DSM6).

On the other hand, you can ignore it: free ram was intended not for the container to use, instead it was to be kept unused, to benefit filesystem cache intended to offload some of the random IO from the disk subsystem, hopefully enough for it to be able to handle the rest of the IO.

BTRFS allows to control it to very fine granularity (IIRC on the file level) but synology only allows control on the sub-volume granularity and only on sub-volume creation stage. Synology calls btrfs subvolumes ā€œsharesā€. Your best bet, to avoid yanking the carpet from under DSM risking breaking some assumption it may have made, would be to create a new ā€œshareā€, not enable ā€œconsistency verificationā€ on it (IIRC this is what they call checksumming) and then copy storj data from the old share to the new one.

That is likely less applicable to your specific case, but in case you still donā€™t get enough random IO performance you may want to consider RAID10 instead of RAID5/SHR1. This goes dangerously close to ā€œwasting moneyā€ you mentioned ā€“ because unless your regular workload benefits from RAID10 you would be wasting time and space, and as a result, money, doing this just for storj.

bought mine used, at like 1-2$ pr 10GB a good while back, so really they are not more extensive than new less durable SSD, generally the used SSD market seems to be very much based on capacity to price for the most part, and the new SSDā€™s set the pricing.

ofc there are some supply and demand considerations, some of the drives i bought have tripled in value after i got them.

Mirrors are not a good solution for storagenodes as most of the workload is writes.
for example right now iā€™m seeing a 24 write IO for every 1 read IO
mirrors are not really faster for writes, infact they are often slower because both disks wants to write in sync, and because their reads are not in sync, they often wait for the other.

ofc this is where the raid 0 in the raid10 comes in, so its a stripe across both, but again, a stripe across two mirrors doesnā€™t give the storage more IO, which is the main limitation of the storagenodes.
so long story short, even a raid10 wouldnā€™t be a big performance increase, because each write goes to all disks at the same time, the only advantage of the raid10 is the raid0 which will double the bandwidth and thus halve the time some IO takes to complete.

but this i would expect to be very very limited uplift.

similarly read caching is for the most part very low usage, since the majority of storagenode IO is writes.

Yep, it makes sense. Most storage node IO should be appends to databases. I think Iā€™ll wear this SSD a bit and then buy another one (so that both wonā€™t crash at the same time) so I can use WR cache. I acknowledge the ā€œagainstā€ recommendations but Iā€™ve gotta do something and could not come up with a better solution. This time, however, I wonā€™t repeat the same mistake of using the whole SSD for caching. Iā€™ll do 50%ā€¦
In the meanwhile I think synology is using the SSD cache differently. Previously it would just fill up regardless of cache size. Now it seems its algorithm got smarter or they just added a simple rule stating ā€œwhatever cache size the stupid ignorant user sets up, use only 2/3 of itā€. Iā€™m saying this because out of the 250GB nvme, I gave the syno 100GB to play with. Itā€™s been a few days and the reported cache used remains pretty constant at 65.9GB, though I threw at it hard Plex database action.
Iā€™ve noticed since that my reported container used memory has increased again (~2 or ~3 GB). Anyway, the %util is sustained at ~50% instead of the (no nvme RO cache) ~90-100%. Also, the access to the storage node status page is fast.
I know you canā€™t really trust ā€œused container memoryā€ shown by synology docker, but so far it correlates pretty well with %util and access time to the status page (RO cache!). When I hit 5-6 GB of container used memory, everything turns to shit.

1 Like