Graceful removal of power can send commands from host system to SSD to give sufficient time to prepare for shutdown. This allows the SSD to flush data which in transition or in temporary buffers to the NAND flash memory. However, due to unexpected power loss without prior command notification, data currently in transition from host to NAND media or in temporary buffers which are not fully committed NAND media are vulnerable to being lost. Unsafe power outages or shutdowns can cause critical data loss. So SSD needs to have an effective methodology to ensure data integrity against sudden power loss.
To minimize potential data loss during unsafe power outages or shutdowns, the PBlaze4 series includes a power-fail detection circuit with power loss capacitor. As Fig.1 shows, the e-Fuse module constantly monitors SSD’s supply voltage. If the circuit voltage falls below defined threshold voltage, thus predicts unexpected power loss is imminent. Switch (SW) is closed, e-Fuse shuts down to disengage the host power supply source and use backup capacitor power supply. Then the capacitor starts to discharge to provide sufficient power (capacitance) to SSD for flushing data currently in transition or in temporary buffers to NAND media. When SSD power is restored, the capacitor starts to charge the current.
Single-level cell (SLC), multi-level cell (MLC) are two types of NAND flash storage designed to store 1 or 2 bits in one cell. SLC has superior write speed and longevity but with lower capacity. MLC provides twice capacity of SLC but the tradeoff is cell lifetime, while it is sufficient for many applications including enterprise storage environments. Pseudo-SLC (pSLC) is a variant of MLC which can bring SLC’s speed and durability to MLC. Typical write endurance about pSLC is about half of SLC’s 60,000 compared to MLC’s 3000.
As NVMe1.1 specified, Metadata is contextual information about firmware and a particular LBA of data, it includes information on wear leveling, error correction, translation tables, logical to physical mapping of data (FTL), read/erase counts, free/bad block bitmap, and so on. Metadata correctness is critical to the system reliability and its size scales with SSD capacity. Fast construction of the metadata when booting up would be necessary in many application scenario. Taking these into account, pSLC is chosen to store metadata.
With a single controller, PBlaze4 Series splits the memory array (die or LUN) into two sections, providing a high-reliability section which is initialized to pSLC mode and a high-capacity section which is MLC, shows as Fig.2. So important metadata information that changes more often is stored in the pSLC partition.
To better protect metadata, PBlaze4 Series adopts Multi-copy technology for metadata redundancy and performance improvement. As Fig.3 illustrates. If metadata read request to one of the LUN failed, it can be serviced by another LUN which will be requested in the set. The metadata continues effective as long as at least one LUN is functioning.
PBlaze4 series utilizes the latest MLC enterprise NAND. There is a known possibility that data stored in NANDs can get incorrect (randomly and spontaneously) due to program/read disturb, P/E cycles increased and data retention. At the same time, the bit errors increase as NAND flash memory scales below 2xnm process technology and transitions to 3-bit per cell architectures. So NAND requires ECC (Error-Correcting Code) to ensure data integrity. The error correction capability (number of bit errors that can be corrected) depends on the ECC algorithm used.
PBlaze4 Series utilizes Bose-Chaudhuri-Hocquenghem (BCH) as ECC algorithm. The BCH can correct multiple bit errors and are widely used on MLC NAND flash. The biggest advantage of BCH is that it can correct any combination of errors (burst or separate) within error correction capability and it is also simple to decode and implement.
PBlaze4 Series supports 100 bits per 4KB BCH error correction. On previous generation PBlaze3 Series, the error correction is 43 bits per 1KB. The future PBlaze5 Series will adopt LDPC, which is more powerful ECC algorithm and can correct more errors with the same number of parity bits.
The stability and reliability of NAND needs to be guaranteed with many ways. ECC can correct burst or separate errors within error correction capability, but for whole page error, even block error such massive bit errors only can leverage the redundant array of independent NAND (RAIN) protection, which is a RAID-like scheme that offers device-level data protection. Although they are rare possibility events, still needs to be considered to avoid data loss.
The same as PBlaze3, which Memblaze’s last generation SSD series, the user data is distributed across the LUNs and stores parity information for each data stripe on different LUN in the NAND by utilizing RAID5. As Fig.10 shows, PBlaze4 adopts N+1 RAID group where N is user data elements. The labeled P represents parity data being generated and embedded with user data. What is import to understand is that the RAID group number N+1 is a parameter which is selected according to a myriad of factors, after many different group number were considered during design, the final implemented value can provide a good balance between performance and capacity.
As it is known that NAND flash memory has a finite number of Program/Erase cycles, at the same time, read disturb error appears more easily and data retention capability reduces as the P/E cycles increase. Due to exist hot and cold data, the flash is used often for hot data storage will first wear out. When the P/E number is exceeded, the reliability of the cells starts to decrease and will eventually become unusable (bad block), requiring the entire block to be replaced by spare blocks.
How to manage the flash wear out phenomenon? The answer is Wear Leveling (WL), which tries to even out the distribution of P/E operations on all available blocks in the flash drive thereby maximizing whole SSD endurance. There are two type of WL, static and dynamic, which are both utilized on PBlaze4.
As mentioned earlier that FTL is used to map Logical Block Address (LBA) to Physical Block Address (PBA). When applying Dynamic WL, new data are written to free data blocks. The target block to store new data is chosen based on its P/E cycles. After the new data is written, map entry links to new PBA, and original PBA with old data is marked as invalid data. Dynamic wear leveling addresses the issue of repeated writes to the same blocks by redirecting new writes to different physical blocks, thus avoiding premature wear out of the actively used blocks. Import to note is that only the dynamic data being recycled to avoid additional wear. Since PBlaze4 utilizes global FTL (details refer to Technical White Paper_MemSpeed), the ware out is more evenly.
What about static data, which unchanged for long periods of time. The static wear leveling moves static data to new location. So that the original block can be used for data that is changed more frequently.
We already got the conclusion from Fig.14 that data retention capability decreased as temperature increased. In the same meantime, circuit my damage due to excessive heat dissipation.
A dynamic Thermal Throttling (TT) technique is implements in PBlaze4. As Fig,14 illustrated, following NVMe1.1, when the first temperature threshold (user define) is exceeded, a critical warning event is issued to host, then system will downgrade performance in linear steps dynamically. Performance will raise back after temperature decreased automatically.
When the second temperature threshold (internal hard threshold) is reached, all read/write operations will be ceased immediately, in case of data loss from overheating. The device will re-workable after technician checks the thermal environment.
As previous described, unsafe power outages can cause critical data loss. So besides ensuring data integrity via enhanced power failure technology during firmware upgrade, protection scheme also being implemented by firmware itself. As known that flash has data retention limitation and read disturb possibility, thus also a reason why needs to protect firmware.
NVMe1.2 defines a firmware slot as a location in the controller to store a firmware image. PBlaze4 adopts multi-slot for firmware image storage. Some slots are read only or hold the specified firmware version which must be retained in case of needing to revert to prior image, such as unexpected power loss. During upgrade, firstly confirm the available slot that is writable and the firmware is not running. This slot is chose for firmware download. The validity of firmware image is verified by means of CRC and digital signature after download. When the slot is marked as active, the active firmware slot will switch (activate) from the slot which currently in use to the slot assigned to download image. PBlaze4 requires system hot reboot during firmware upgrade.