Fsck isn't working for NVMe drive

I am unable to successfully execute fsck for the NVMe drive. The drive is exclusively used for storagenode so nothing else is on the disk.

$sudo e2fsck -f /dev/nvme0n1p1
e2fsck 1.47.1 (20-May-2024)
storj3: recovering journal
Superblock needs_recovery flag is clear, but journal has data.
Run journal anyway<y>? yes to all
e2fsck: unable to set superblock flags on storj3

storj3: ********** WARNING: Filesystem still has errors **********

$sudo e2fsck -b 512000000 /dev/nvme0n1p1
e2fsck 1.47.1 (20-May-2024)
Superblock needs_recovery flag is clear, but journal has data.
Recovery flag not set in backup superblock, so running journal anyway.
storj3: recovering journal
Superblock needs_recovery flag is clear, but journal has data.
Recovery flag not set in backup superblock, so running journal anyway.
Superblock needs_recovery flag is clear, but journal has data.
Recovery flag not set in backup superblock, so running journal anyway.
e2fsck: unable to set superblock flags on storj3

storj3: ***** FILE SYSTEM WAS MODIFIED *****

storj3: ********** WARNING: Filesystem still has errors *********

I have tried with backup superblocks but they all fail.

nvme0n1     259:0    0   3.6T  0 disk
└─nvme0n1p1 259:1    0   3.6T  0 part /mnt/storj

What am I doing wrong ?

Yes, it has a few GBs. This is my third attempt at fixing the issue. Its a new node so I had formatted the drive on previous 2 occasions.

Why are we talking about formatting ?

Just to reiterate, I need help to fix my new node as I get more “file does not exist” errors. While trying to fix the issue by running fsck I get the above errors.

Are they errors, or warnings?

Since, there are many benign “not exist”-warnings around. Because if a file has been deleted by a customer before the TTL ends and is being removed by a bloom filter, it will show up as an error later when TTL is over.

In this case you should rerun e2fsck till it doesn’t say anymore there are weird in the filesystem left.

It cannot write a super-block on your drive. Meaning, it either rejects writes, or accepts them and drops on the floor.

Your NVME drive is likely dead. Becoming read-only is a safe way most decent drives die. What is the remaining endurance does it report? What’s the rest of the SMART data it reports? smartctl -a /dev/nvme0

Yes they are audit failed errors hence my worry.

$sudo smartctl -a /dev/nvme0n1
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-44-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SSD 4TB
Serial Number:                      003124
Firmware Version:                   VC2S0388
PCI Vendor/Subsystem ID:            0x10ec
IEEE OUI Identifier:                0x00e04c
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00e04c 11ca4b5553
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0054):     DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     100 Celsius
Critical Comp. Temp. Threshold:     110 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.00W       -        -    0  0  0  0        0       0
 1 +     4.00W       -        -    1  1  1  1        0       0
 2 +     3.00W       -        -    2  2  2  2        0       0
 3 -   0.0300W       -        -    3  3  3  3     5000   10000
 4 -   0.0050W       -        -    4  4  4  4    54000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          32%
Percentage Used:                    0%
Data Units Read:                    34,182,859 [17.5 TB]
Data Units Written:                 1,702,831 [871 GB]
Host Read Commands:                 135,414,195
Host Write Commands:                14,600,777
Controller Busy Time:               0
Power Cycles:                       69
Power On Hours:                     292
Unsafe Shutdowns:                   46
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 8 of 8 entries)
No Errors Logged

Read Self-test Log failed: Invalid Field in Command (0x002)

Its a brand new drive.

What brand is the drive? 0x10ec is Realtek. OUI 00:E0:4C is also Realtek. Realtek is pretty new to SSD controllers. Stuff they are not new to – e.g. ethernet controllers – are known shit. I’m wondering who made that SSD? Who is so careless to “forget” to update the PCI IDs to their own, and probably use whatever buggy firmware they found.

If I were you I would return it and buy something else. Maybe buy something used from ebay but made by the company that has established reputation.

It lost data – strike one. It can’t repair filesystem – strike two. It’s out.

This is a bad idea. I would explicitly configure it with 4096 size. Because 512 is a lie.

And due to this, consider SSD with PLP.

Its without a brand which is why I got it cheap. I had never used a NVMe drive before so I went for it.

I just used the Disks utility in Ubuntu. I will try with 4k size.

Just so I don’t mess this up again, the right command :point_up: would be :point_down:

sudo mkfs.ext4 -b 4096 /dev/nvme0

Would that be correct ?

When I first formatted the drive, I was able to run fsck without any issues. It gave the list of inodes and all. Now it won’t so I am just baffled.

Have a look at ebay. I recently bought a pair of Intel DC P3600 2TB SSDs with 98% endurance remaining for under $100.

I’m pretty sure you also want to ensure a partition is also aligned to 4k, but I don’t have much experience with linux filesystems.

A common way to produce a fake high capacity drive is to alter firmware to report higher capacity than physically available. The disk will work initially, until you fill the available flash, and then it wraps around to the beginning, at which point “bad things happen”.

I would never use a noname drive, let alone pay money for one. It’s never worth it.

If you can’t return it – open it up, and count how many and what capacity NAND chips are there. You might be surprised… Then submit dispute with your credit card provider since there is a differennce between poor quality product and fake.

I’ve been a victim of oversized USB flash: but yeah I guess NVMe drives could misreport the same way. I know there were apps (h2testw, validrive, CapacityTester etc) to verify… but I’m not sure which are USB-only (or if they even care). Maybe f3 on Linux?

2 Likes

I had previously performed a surface test using Easus Partition Master. IIRC it was progressing at ~61GB per min and lasted for an hour. So 4000GB/61GB = 65 minutes which makes it seem like its indeed 4TB.

It’s not the same thing. Surface test writes a bunch of data and reads a bunch of data – it will succeed if the disk lies about size, as long as it’s size is larger than the block that is being written and read. The purpose of these tools is to verify that every logical sector is writable and readable, to find defects, not deceit. (BTW, it’s quite pointless to do it on SSD because SSD remaps everything all the time as part of normal operation; it made sense on HDDs)

Dedicated tools like f3 (and I’d really disheartening that we had to develop such tools in the first place) verify that after you have reached writing the end of the drive, the originally written data across the whole drive is intact – to do that they write random data, and then read and compare it with random data with the same seed.

You can easily do that yourself I you want – send zeroes to openssl enc with fixed password – the sequence of bytes on the output will be random and always the same. Then read from drive and compare.

Edit:

This is how can you do that (note, this is obviously destructive, all data will be destroyed):

  1. Write 4TB worth of pseudo-random data to the disk

    dd if=<(openssl enc -aes-256-ctr -pass pass:"MEOW" -nosalt </dev/zero 2>/dev/null) \
       of="/dev/nvme0" \
       bs=1G count=4096 \
       iflag=fullblock
    
  2. Then diff the content of the disk with the same sequence. You don’t have to read the whole disk, first few GB shall suffice, something like this (replace the 10,485,760 with whatever, up to a disk size):

    diff <(head -c 10485760 <(openssl enc -aes-256-ctr -pass pass:"MEOW" -nosalt </dev/zero 2>/dev/null)) \
         <(head -c 10485760 /dev/nvme0)
    
2 Likes

I disagree. Point by point:

  • It was already done correctly from scratch when the drive was first put into service and yet it still failed. Repeating the same experiment and expecting different outcome is not wise.
  • If the drive lies about its size, freshly formatting it will make it work again, until the amount of data written fits in the flash that exists. Then it will fail in exact same way. You still can’t trust it.
  • “Low level formatting” was a thing in 1990ies where disks had mechanical stepper motor to move the heads. They could format themselves from clean state. Since then, hard drives rely on service tracks on the media to guide the heads. These tracks are written at the factory. if they are lost – your drive is a toast. When it comes to SSD – low level format is meaningless. It does not have tracks, there is nothing to format. SSD reads and writes data from addresses, that mimic logical blocks to be drop-in replacements for HDDs. NVME SSDs dont’ even do that, so that term completely loses its meaning.

And the most important bit:

If the media “started to work” it still goes to the e-waste. There are no second chances. It already failed. It can never be trusted again.

To be precise, no media can be trusted. That’s one of the reasons we do backups. It just has benefit of the doubt, that it hasn’t failed yet, and it may work another day.
Media that failed – proved it’s bad. Why would you trust bad media?

1 Like

I agree, fake drive. I agree with everything arrogant rabbit said last and previous posts.
Don’t think you should waste your time.

OP should simply ask himself two questions.

  1. Was price too good to be true, ie: <$60-80 USD per TB for generic garbage. (SSD prices have been elevated lately, last 4 months)
  2. Is it still within it’s return window - if so, quickly return it.
    Save yourself further headache.
1 Like