IO - Data Path / Balanced System

About

The data paths is the path composed of all hardware components that are needed to get the data from:

a storage device (generally disk drive or network drive)
to the CPU

It is important to understand the different transfer rates of each component of the server's disk subsystem and of the network. This information helps you to identify potential bottlenecks that can throttle your overall performance.

In the figure, data travels:

from the actual disk drive
to the embedded disk controller located on the disk drive unit (<10 Mbytes/sec),
up the Ultra Fast/Wide SCSI channel 2 at 40 Mbytes/sec,
through PCI Slot 1 on PCI Bus 1 at 133 Mbytes/sec
to the Memory subsystem (533 Mbytes/sec),
and then transferred to the CPU at a P6 system bus speed of 533 Mbytes/sec.

If one component is trying to send more data than the next component can handle, there is a bottleneck. An analogy to this is a plumbing example. If the primary water pipe carrying water away from your basement is five inches in diameter and you have five two inch pipes placing water into the five inch pipe, water will be spilling out.

By completing a little mathematical word problem, you can avoid bottlenecks even before they begin.

For example, placing two 3 channel Ultra Fast and Wide SCSI cards (theoretical aggregate maximum throughput of 2x[3×40 Mbytes/sec] = 240 Bytes/sec) into a single PCI bus can overwhelm the single PCI Bus data link if all of the SCSI channels were active.

A single PCI bus can only support a theoretical maximum of 133 Mbytes/sec. Jamming 240 Mbytes/sec of data into it just does not work very well. If this configuration were actually implemented, you would have created a bottleneck from the start. Placing each one of the 3 channel Ultra Fast and Wide SCSI cards onto their own respective PCI bus will spread the disk I/O activities across 266 Mbytes of total aggregate PCI bus throughput.

The disk drive itself is the slowest link of the data path.

Articles Related

Example

If your system is intended to run an I/O intensive workload, then plan the system conservatively, assuming that every CPU core can process approximately 200 MB/s sustained.

For example, if you want to keep 4 CPU cores busy in such a configuration, then the entire I/O subsystem should be able to support 800 MB/s sustained for optimum performance.

The I/O throughput requirement has to be guaranteed throughout the whole hardware system (the whole data path):

the Host Bus Adapters (HBAs) in the compute nodes,
any switches you use,
and the I/O subsystem, incl. storage controllers and physical spindles.

The weakest link is going to limit the performance and scalability of operations in this configuration.

If you rely on storage shared with other applications then the throughput performance for your application is not guaranteed and you will likely see inconsistent response times for your operations.

Parallel execution is also a heavy consumer of memory. Per CPU core you should have at least 4 GB of RAM.

If you use inter-node parallel operations that spawn multiple nodes - then you have to size the interconnect appropriately, it as crucial as the overall I/O capabilities. The throughput required on the interconnect for good scalability is at least equal to the throughput going to disk. (Use (multiple) 10 GigE or Infiniband interconnect)

Disk (HDD) - Striping

Any physical disk may be able to sustain 20-30 MB/s for large random reads. Considering that you need about 200 MB/s to keep a single CPU core busy (i.e. 8 - 10 physical disks), you should realize that you need a lot of physical spindles to get good performance for database operations running in parallel.

Do not use a single 1 TB disk for your 800 GB database, because you will not get good performance running operations in parallel.

The way to utilize multiple physical spindles is to stripe across multiple disk devices through:

an hardware (RAID 1 or 5)
or a software solution (for instance Asm with an Oracle database).

Documentation / Reference

NT Server and Disk Subsystem Performance