Live and let die: increasing capacity in the flash industry

Wells Fargo financial analyst Joe Quatrochi has issued a note to subscribers describing the state of capacity increase play in the NAND industry. He’s interested because investors are curious about publicly quoted NAND industry suppliers. B&F is interested because of the technology picture it provides.

Update; Flash endurance table replaced with updated values. 20 November 2020.

Flash dies are built on circular wafers using semi-conductor techniques such as vapour deposition and etching. Flash foundry operators are facing a conundrum: how do they increase capacity to meet bit demand?

There are five basic ways. One is to increase capacity per die by adding layers to the 3D NAND dies they build on their wafers. A second option is to add bits to cells – moving from, for example, 3 bits/cell (TLC) to four (QLC).

The third option is to shrink the physical cell size on a NAND die, meaning more dies per wafer. However a cell’s ability to store rewritten values reliably, its endurance in terms of write cycles, decreases with the process size, and also the cell bit count, to the point where no further progress is possible.

Indicative values from averaged industry sources.

The fourth option is for manufacturers to build more wafers, measured in wafer starts per month (wspm), which means building new fabs once a fab’s wafer making capacity is fully allocated.

The fifth option is to pick two or more of those options together: increase the layer count, the bits/cell count, the physical cell size, and the wspm at the same time.

Quatrochi’s research gives an insight into the issues involved in these choices.

Layer counts

Adding layers adds process complexity and process time. Etching a hole through 64 layers is easier than etching one through 96 layers and the difficulty rises again as the layer count increases to 112, 128, 144, etc. This means that the yield of good dies from a wafer can go down as badly-etched holes render a die ineffective. Yields going down mean costs per TB of NAND on the wafer go up.

NAND Suppliers and known layer count stages.

Quatrochi mentioned an aspect ratio issue, stating: “The vertical stacking attributes of 3D NAND make it increasingly dependent on precision in the etching process to leverage higher aspect ratios, while deposition consistency continues to be more difficult.” 

He added: “Aspect Ratio continues to increase with the increase in layer count – a 96-layer device is estimated roughly 70:1 (vs. 60:1 for 64L). Continued increases in Aspect Ratio results in a number of potential issues during the deposition and etching steps, including non-uniform layers, incomplete etch (holes do not reach the bottom), bowing, twisting, and critical dimension variation between the top and bottom of the stack.”

These issues can render a die useless and lower the yield per wafer.

Adding layers, however you do it, increases process time and thus costs as well.

Quatrochi wrote: “Moves to higher layer counts result in lower wspm as processing time increases / additional steps are added. For example 128L single-stack etch has been estimated as taking 2x the time of 96L single-stack.” 

If a single machine takes 5 days to build a wafer then you can achieve 6wspm (taking the average month as 30 days). If it takes 10 days then that halves. Suppliers have to ask themselves questions such as, if adding 30 per cent more capacity per die by increasing the layer count actually gets them more output capacity if their machines take twice as long to manufacture the wafers.

A way of dealing with manufacturing problems caused by layer count increases is to build a die from two or three separate sub-units, called strings. A 96-layer die can be made by stacking two 48-layer dies one above the other – so-called string stacking. That reduces overall hole etch depth.

String-stacking does not provide a get-out-of-jail-free card, though: “A string-stack could add as much as 30 per cent more cost with added steps.” Balancing process time and cost increases against capacity gain methods is difficult.

Factor in the possible yield of good dies per wafer going down as layer counts increase and the calculation becomes more complex. Yields tend to increase overtime as the manufacturing process is tuned but it is not an exact science.

Bits per cell

Yet another complicating factor is the bits per cell count. You get declining benefits from this too, as we saw above. An SLC to MLC (2bits/cell) change gets you a 100 per cent capacity increase. Moving to TLC (3bits/cell) brings a 50 per cent capacity rise rewards. But the TLC to QLC (4bits/cell) transition means a 33 per cent step up while QLC to PLC (5 bits/cell) is a 25 per cent rise and, were it possible, a move to 6 bits/cell would bump up capacity 20 per cent.

Actual yield, TB per wafer, can differ from the theoretical gain. Quatrochi wrote: “QLC 3D NAND TB/wafer estimated at upwards of 70TB/wafer; W. Digital QLC 112-layer estimated 40 per cent higher TB/wafer than TLC 112-Layer.” That’s better than the theoretical 33 per cent increase we noted.

NAND bit recording quality deteriorates with each increase in cell count. QLC NAND is not the same stuff as TLC NAND. It takes longer to read and write bits and the life of a QLC cell is shorter than that of a TLC cell as we noted above. PLC makes things worse again.

SK Hynix 128-layer wafer, chips, U.2 and ruler drives.

Over-provisioning by adding spare capacity (extra cells) to replace worn-out cells increases costs. What would be the point of replacing a QLC SSD having 20 per cent over-provisioning with a similar usable capacity PLC drive having 50 per cent over-provisioning and the same overall endurance if there was no cost benefit?

Decreasing layer count gains

The ability to drive capacity per wafer higher by increasing layer counts brings deceasing gains. Moving from 64 to 96 layers adds a 50 per cent capacity increase. Transitioning from 96 to 128 layers adds 33 per cent more capacity. Adding 32 layers again to reach 160 layers means a 25 per cent capacity rise. Progressing to 192 layers brings a smaller 20 per cent capacity rise. Eventually the gains are outweighed by the extra processing time and yield issues. 

At that point you can add bit capacity by making more wafers. A 50 per cent utilised machine can make twice as many wafers by being 100 per cent utilised. After that you need more machines, and that leads to building a new fab – ker-ching – that will be $15bn please.

Net:net

The pace of gains in SSD capacity tends to slow because of these various NAND die production issues. The effect of this can be seen in a chart showing industry shipped capacity by layer count over time:

We see that it is taking longer (in terms of quarters after launch) for 96L NAND to rise to 70 per cent of industry shipped capacity than it took 64/72L NAND. The first iteration of 100+ layer NAND looks likely to take longer still. 

The NAND and SSD industries are very fertile technologically speaking. They are not, overall, running out of capacity increase runway but the costs and difficulties of increasing capacity are rising and cul de sacs approaching, as with cell bit counts. 

Ruler format drives will increase the physical space available for NAND dies. Drive and host controller error checking technology, write cycle reduction through random write avoidance, and over-provisioning will help QLC and then PLC NAND become usable in more cases. Better process technology and materials will help smaller cell sizes become practical. The outlook is positive.

Wafers and disks

Such layer-count issues do not affect disk drive manufacturers, who also use semi-conductor techniques to build bit storage entities, magnetic domains,  on circular wafers, called disks and physically smaller than NAND wafers. Disks have a 1-dimensional structure, a single layer. The HDD manufacturers’ problem is shrinking the size of the bits while keeping stored bit value stable and readable. 

With magnetic materials currently in use bit area shrinkage reduces the number of electrons in a bit. The stability of the bit area’s magnetic field reduces towards unpredictability at room temperature as bit size diminishes.

Disk suppliers are moving to energy-assisted recording to achieve bit stability and readability at room temperatures through writing the bits to more stable recording material. This resists magnetic polarity change more strongly at room temperature. The bit area is made receptive to change through heat (HAMR) or microwave (MAMR) energy. This brings its own problems but having multiple layers is not one of them.