ERROR collector unable to delete piece v1.22.2 following DB lock

Hi All,

What do you think ? is something broken, or about to break ? looks like an expired piece failed to delete due to DB lock, and then its got confused ?

Node (v1.22.2) is running in docker, on alpine Linux 3.13 virtualised on X64
Node data is on F2FS filesystem, mounted at virtual level on REFS

I’ve only noticed as was about to upgrade to new docker release, and was checking logs and noticed error.

It’s for the same satellite - 12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs
It’s for same piece - Q2CO4W4OONKJBJL7WDWQBDQGICNVOWD64XB23YT3WVLAZGGAHV5A

It repeats in the logs every hour, just this one piece and satellite and started on 25th Feb 2021.

Isn’t there some scrub process, or garbage process to kill this ? ideas on how I could fix :slight_smile:

#Edit, so tracing that piece.

Upload started - Feb 9th 2021

2021-02-09T23:50:57.692Z INFO piecestore upload started {"Piece ID": "Q2CO4W4OONKJBJL7WDWQBDQGICNVOWD64XB23YT3WVLAZGGAHV5A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Available Space": 758338564160}

Upload Completed - Feb 9th 2021
2021-02-09T23:51:25.896Z INFO piecestore uploaded {"Piece ID": "Q2CO4W4OONKJBJL7WDWQBDQGICNVOWD64XB23YT3WVLAZGGAHV5A", "Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Action": "PUT", "Size": 2319360}

Request to Delete expired Piece - DB locked - Feb 25th 2021
2021-02-25T00:29:09.470Z ERROR collector unable to update piece info {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "Q2CO4W4OONKJBJL7WDWQBDQGICNVOWD64XB23YT3WVLAZGGAHV5A", "error": "piece expiration error: database is locked", "errorVerbose": "piece expiration error: database is locked\n\tstorj.io/storj/storagenode/storagenodedb.(*pieceExpirationDB).DeleteFailed:99\n\tstorj.io/storj/storagenode/pieces.(*Store).DeleteFailed:546\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:103\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Repeating hourly error from expired piece, from then until now
2021-02-25T00:30:22.082Z ERROR collector unable to delete piece {"Satellite ID": "12L9ZFwhzVpuEKMUNUqkaTLGzwY9G24tbiigLiXpmZWKwmcNDDs", "Piece ID": "Q2CO4W4OONKJBJL7WDWQBDQGICNVOWD64XB23YT3WVLAZGGAHV5A", "error": "pieces error: filestore error: file does not exist", "errorVerbose": "pieces error: filestore error: file does not exist\n\tstorj.io/storj/storage/filestore.(*blobStore).Stat:99\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).pieceSizes:239\n\tstorj.io/storj/storagenode/pieces.(*BlobsUsageCache).Delete:220\n\tstorj.io/storj/storagenode/pieces.(*Store).Delete:298\n\tstorj.io/storj/storagenode/collector.(*Service).Collect:97\n\tstorj.io/storj/storagenode/collector.(*Service).Run.func1:57\n\tstorj.io/common/sync2.(*Cycle).Run:152\n\tstorj.io/storj/storagenode/collector.(*Service).Run:53\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2.1:81\n\truntime/pprof.Do:40\n\tstorj.io/storj/private/lifecycle.(*Group).Run.func2:80\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:57"}

Have you tried restarting to see if you continue to see the same error message ?

Hi :slight_smile: yep I restarted it in morning - it’s been up 8 hours and it’s still going.

CONTAINER ID   IMAGE                          COMMAND         CREATED       STATUS       PORTS                                                NAMES
1346266dc8a8   storjlabs/storagenode:latest   "/entrypoint"   8 hours ago   Up 8 hours   0.0.0.0:14002->14002/tcp, 0.0.0.0:28967->28967/tcp   storagenode

I think I’ve hit scenario not planned for - still trying to pull stats together so I guessing here, looks like SqlLite was the cause of the locking issue.

Think it was caused by disk queue, causing 100cpu load, causing lots of temp files, causing SQLite to either max connections, or more likely timeout.

Think this caused a transaction to delete the expired piece to fail i.e only deleted file on disk, not updated DB and there was no rollback so node now out of sync, with what it expecting - so just keeps trying to delete a file not exists - hmm

tempted to either touch the missing piece on disk so it has something to delete, or edit sql db and drop the row, although it’s stored a blob :thinking: not sure if I break more fidling

It might have to do with F2FS. What is the size of your node ?

it’s tiny, only 4 months old and about 460GB - yeah f2fs could be problem, good point ! let me check make sure I’ve not hit filesystem limit, that would be embarrassing.

#Edit - not run out of room :slight_smile: although dirty segment warm high, so might need to run GC more frequently, although still plenty free.

storj:~# cat /sys/kernel/debug/f2fs/status

=====[ partition info(dm-0). #0, RW, CP: Good]=====
[SB: 1] [CP: 2] [SIT: 44] [NAT: 76] [SSA: 1169] [MAIN: 597236(OverProv:2192 Resv:1119)]

Utilization: 32% (98268137 valid blocks, 207192133 discard blocks)
  - Node: 474341 (Inode: 474340, Other: 1)
  - Data: 97793796
  - Inline_xattr Inode: 1900
  - Inline_data Inode: 347
  - Inline_dentry Inode: 11

Main area: 597236 segs, 597236 secs 597236 zones
    TYPE            segno    secno   zoneno  dirty_seg   full_seg  valid_blk
  - COLD   data:   153866   153866   153866       2373        985    1383851
  - WARM   data:   546595   546595   546595      77135     143428   96320156
  - HOT    data:      867      867      867        136        128      89789
  - Dir   dnode:      870      870      870         51          1      11664
  - File  dnode:      868      868      868       1049         15     462677
  - Indir nodes:        2        2        2          0          0          0

  - Valid: 144563
  - Dirty: 80739
  - Free: 371934 (371934)

CP calls: 382 (BG: 333)
  - cp blocks : 1567
  - sit blocks : 4767
  - nat blocks : 9113
  - ssa blocks : 522
GC calls: 288 (BG: 288)
  - data segments : 219 (219)
  - node segments : 69 (69)
Try to move 63536 blocks (BG: 63536)
  - data blocks : 42046 (42046)
  - node blocks : 21490 (21490)
Skipped : atomic write 0 (0)
BG skip : IO: 209, Other: 5

Extent Cache:
  - Hit Count: L1-1:74642 L1-2:0 L2:14
  - Hit Ratio: 81% (74656 / 91923)
  - Inner Struct Count: tree: 1321(0), node: 17

BDF: 92, avg. vblocks: 300

Memory: 166901 KB
  - static: 137497 KB
  - cached: 283 KB
  - paged : 29120 KB

Decided to manually delete entry out of SQL as it was not going to stop logging.

Row from SQL dropped, will see if error disappears - was only entry which had value set for deletion_failed_at.

sqlite> select hex(piece_id) from  piece_expirations where deletion_failed_at is not NULL;
sqlite> delete from  piece_expirations where deletion_failed_at is not NULL;

piece_expirations VALUES(X'af2c42003efc826ab4361f73f9d890942146fe0ebe806786f8e7190800000000',X'8684ee5b8e735490a57fb0ed008e06409b57587ee5c3ade27bb5560c98c03d7a','2021-02-23 23:49:59.356297216+00:00','2021-03-03 14:08:29.850290797+00:00',0);

Did it resolve the issue ? Did you notice any other issue ?

Hi, yep it has resolved issue - no more errors in log on pieces error. Me think that as it was an expired piece, satellite wouldn’t care on audit, as the piece was meant to be expired and not on disk - I think I might of got away lightly.

#Note : I wouldn’t recommend other SNO’s editing SQL database on nodes, this bad thing to do, only risk if like me you not mind node breaks.

I’ve not noticed any other issues, apart from EU-north satellite hating me :joy:

Thanks for your interest :slight_smile:

I now playing with the uplink client on US Beta, it really interesting and cool how Storj works - I probably be posting lots of questions on that next.

C.P.

1 Like

Another fix would be to create an empty file with the right name in the right folder. The next time the collector runs the file will be deleted. I just did it and the hourly error message is gone.

1 Like