Login | Register | 中文
MemSpeed 2.0
High Performance FTL

There is a characteristic of SSD that in order to write to an area in the physical media, it must be erased on a block before. So writes to the same Logical Block Address (LBA) will be mapped to different physical locations on the flash media (PBA). The flash translation layer (FTL) is a hardware/software layer which is responsible for the logical to physical mapping of data. How to improve FTL performance.

Global FTL

What does the “Global FTL” represent? It is a new algorism utilized on PBlaze4. The global FTL is relative to partition FTL on PBlaze3, the different is direct mapping between logical and physical address. As Fig.2 shows, LBA to PBA as one-to-one global mapping, all PBAs on SSD can be selected for LBA mapping. Advantageously, this simple mapping allows a straightforward access between the logical sector identifier and its physical location in an address translation table. Meantime, increasing the IO randomization and thereby the ware out is more evenly and maximizing whole SSD endurance.

Flash Channel QoS

What is Flash Channel QoS? SSD has a Quality of Service (QoS) parameter which is the requirement that a given application complete all requested processes under steady and consistent performance within a specified time limit. Usually, SSD QoS is given as a maximum response time under the certain confidence level of 99% or 99.99% (“2 nines or 4 nines” of confidence level).

Flash channel QoS is the ability to guarantee the consistently performance and reduce latency jitter, it is realized as scheduler process.

How to guarantee the flash channel QoS?

The scheduler process can guarantee a certain level of performance to message flow. Scheduler is used to decide which request should be put in the per LUN command queue. How to schedule? The schedule process will start according to different request priority scheme after receive amounts of write, read, erase commands. User data has the higher priority, as write cache mechanism is utilized, completion acknowledge (ACK) will be sent back to upper level application before the data has been successfully written to flash. It is transparent process for user that flush data from cache to NAND. During this process, scheduler will take effective to control message flow at the background. The set of prioritized request then be transferred to relevant target LUN queue. Each LUN has limited queue depth to process, every time the scheduler dispatches one request on per LUN command queue.

Same as no traffic lights on the crossroads, it causes cars get into accidents and traffic jam, no schedule process will lead to message flow congestion. So congestion and flow control are key mechanism used to regulate the message flow NAND inside to improve performance.

Multi-core Computing

The computer industry has moved towards multiple cores for increased performance, power efficiency and compute capacity in recent years. The same as computer, multi-core architecture is also designed and implemented on PBlaze4 for energy-efficient performance. More compute capacity is being provided via more processor cores, as well as advanced processor core and cache design.

The processor complex of PBlaze4 contains 16 embedded processors. Multiple instructions can run at the same time with multiple cores, increasing overall speed for programs.

Hardware Multi-Q

Why adopts hardware multi-Q?

The IO performance of storage devices has accelerated from hundreds of IOPS sever years ago, to hundreds of thousands of IOPS today. This sharp increase is primarily due to the development of NAND flash devices. While originally designed single core architecture on host end has become a bottleneck to overall storage system performance, thus the reason why multi-core is designed to hand tens of millions of IOPS, besides, a queue pair per core can avoid locking and ensure process integrity. As host increases the process speed by multiple order of magnitude, the NAND flash devices also need to improve process capability. Hardware multi-queue is the mechanism that PBlaze4 utilized to optimize performance.

How does multi-Q work?

Since the hardware multi-Q is related with host multi-core. To explain this theory, we’d better start with single queue process from host to device.

Pre-allocated Submission Queues is circular buffer with a fixed slot size that the host software uses to submit commands for execution by the controller. Single queue process from host to device can be described as below steps.

1. Firstly, host issues a new command and inserts to an appropriate Submission Queue.
2. Then host calls doorbell and indicates to controller that a new command is submitted for processing.
3. After receive the doorbell notification, controller fetches the command in the Submission Queue from host memory for future execution.
4. Then controller executes the fetched command.
5. After the command has completed execution, the controller writes a completion queue entry to the associated Completion Queue
6. The controller optionally generates an interrupt to the host to indicate that there is a completion queue entry to process.
7. After that, the host processes the completion queue entry in the Completion Queue.
8. In the end, the host writes Doorbell indicates that the completion queue entry has been processed to release completion entry.

So multi-core system has multi-queue pair per core, on the hardware end, the hardware multi-queue is implemented as Queue Engines. The Submission Queue Engines fetches the commands from multi-Submission Queue at super-high frequency, and send the fetched command to different process units. Similarly, reply from process units is collected by Completion Queue Engines and send back to host Completion Queue.

中文

Technical Support: support@memblaze.com

Sales Email: contact@memblaze.com