6 min read

Partition Alignment - Solaris (SPARC)

For historic reasons, most modern disk drives and storage storage arrays claim to the host that they have a 512 byte block size which in fact they actually use something larger - normally 4k or 8k, but possibly even larger.

If you send an IO to such a disk/array that isn't correctly aligned to the internal block size then the array will need to do some additional work to handle it - and whilst this works, it results in lower performance than you'd get if the IO were correctly aligned to the internal block size.

The most common cause of mis-aligned IO is a disk partition that starts at an offset from the beginning of the disk that isn't a multiple of the disks block size. eg, if you're using a disk with a 4k block size, but you create a partition starting 3k into the disk, then all of the IO that you send to the disk will be unaligned from the internal 4k block size.

Most modern OSes handle this for you by starting partitions at nice round offsets - frequently at offsets like 1MB which are guaranteed to be multiples of all array block sizes.

However there are still a few OSes that get this alignment very wrong - and Solaris is one of them.

Solaris SPARC Partition Table Types

Solaris SPARC supports 2 different types of partition schemes - SMI and EFI. SMI is the partition table that Solaris (and SunOS before it) has used for countless years, and is still the default for most new disks initialized under Solaris.

EFI is a newer, industry standard partition table that is used by default on Solaris for disks 2TB or larger which are not supported by SMI.

SMI Alignment

SMI partition tables work in terms of sectors, heads and cylinders, with partitions being defined based on a starting and ending cylinder.

Current partition table (unnamed):
Total disk cylinders available: 41793 + 2 (reserved cylinders)

Part      Tag    Flag     Cylinders         Size            Blocks
  0 unassigned    wm       1 - 41792      499.95GB    (41792/0/0) 1048477696
  1 unassigned    wu       0                0         (0/0/0)              0
  2     backup    wu       0 - 41792      499.97GB    (41793/0/0) 1048502784
  3 unassigned    wm       0                0         (0/0/0)              0
  4 unassigned    wm       0                0         (0/0/0)              0
  5 unassigned    wm       0                0         (0/0/0)              0
  6 unassigned    wm       0                0         (0/0/0)              0
  7 unassigned    wm       0                0         (0/0/0)              0

You can check the number of cylinders, heads and sectors (frequently called the disk "geometry") for a specific disk using the 'current' command within format :

format> current
Current Disk = /dev/rdsk/c3t514F0C5294BFC500d1
ssd4: <XtremIO-XtremApp-40a0 cyl 41793 alt 2 hd 224 sec 112>

So this disk has 41793 cylinders (plus 2 extra that are unused), 224 heads, and 112 sectors - with each sector being the standard 512 bytes.

To calculate the size of each cylinder, we need to multiply the number of heads by the number of sectors, and then multiply the result by 512 bytes. So in this case, that's 224 x 112 x 512 which is 12,845,056 bytes.

Thus if we were to create a partition starting on cylinder 1 it would start at an offset of 12,845,056 bytes into the disk. If we divide this number by 4096 bytes (4k) we get 3136. As this is a whole number, it means that the partition WOULD be correctly aligned for an array using a 4k block size! Dividing by 8192 (8k) gives 1568, so once again it would correctly aligned.

As an exercise, lets say that the disk we were trying to align had 221 heads and 111 sectors and we were trying to align it to a 4k block size. 221 x 111 x 512 = 12,559,872, which if we divide by 4096 gives 3066.375. As this isn't a whole number, starting at cylinder 1 would give us a misaligned partition. As each cylinder is 12,559,872 bytes, a partition starting at cylinder 2 would also be misaligned (12,845,056 x 2 / 4096 = 6132.75), as would cylinders 3 through 7 - but once we get to cylinder 8 we end up with a partition nicely aligned to our 4k block size as 12,845,056 x 8 / 4096 = 25088.

When you think about it, this shouldn't be a surprise - any partition starting at a cylinder that is a multiple of 8 will be aligned to a 4k block, as 8 x 512 = 4k. Similarly, any cylinder that is a multiple of 16 is guaranteed to be aligned to an 8k block.

Also note that cylinder 0 is always going to be aligned - but some people prefer not to use cylinder 0 as it can make it a little easy to accidentally overwrite the SMI partition table which is at the start of cylinder 0.

XtremIO

Thankfully XtremIO makes SMI alignment easy - the disks it presents always have a head/sector count which is an even multiple of XtremIO's 8k block size. Thus no matter what starting cylinder you use, the partitions end up being correctly aligned. Unfortunately this is not true of many others arrays.

EFI Alignment

EFI (Otherwise known as GPT) partition tables are always used in Solaris for disks 2TB or larger. They can also be used for smaller disks by passing the "-e" option to the format command and selecting EFI when labeling the disk.

EFI partitions are defined in terms of sectors

partition> print
Current partition table (original):
Total disk sectors available: 21474820029 + 16384 (reserved sectors)

Part      Tag    Flag     First Sector           Size           Last Sector
  0        usr    wm                40         10.00TB            21474820055
  1 unassigned    wm                 0             0                 0
  2 unassigned    wm                 0             0                 0
  3 unassigned    wm                 0             0                 0
  4 unassigned    wm                 0             0                 0
  5 unassigned    wm                 0             0                 0
  6 unassigned    wm                 0             0                 0
  8   reserved    wm       21474820063          8.00MB            21474836446

This makes the math a little easier than for SMI, but it also means that format will get it wrong by default for many arrays (including XtremIO!)

Looking at the example above (the default partition table created by format), partition 0 starts at sector 40. As for SMI, the size of each sector will be the block size reported by the disk - almost always 512 bytes. 40 x 512 is 20,480 which is an even multiple of 4096 bytes and thus this is correctly aligned for 4k, but it is NOT a multiple of 8192 bytes, which means that it is NOT correctly aligned for an 8k block size!

Changing the starting sector to 48 gives us an offset of 48 x 512 = 24,576 bytes - or 24k which is obviously an even multiple of 8k.

File Systems

Once you've got the partition correctly aligned, it's important that whatever you put on top of it is also aligned to the same block size.

In the case of a filesystem, this means setting the block size of the filesystem to also match the block size of the array.

The good news is that UFS uses an 8k block size on SPARC systems, so there's no additional effort required - it'll be correct by default. UFS also has a concept of a "fragment" which is smaller than or equal to a block. For filesystems greater than 1TB the fragment size will automatically be set to 8KB to match the block size, but for smaller filesystem it will default to 1KB. Unless you are planning to store a large number of very small files, it's generally recommended to increase the fragment size to be 8KB by passing the "-f 8192" option to newfs when creating the filesystem.

Veritas Filesystem defaults to a block size of 1KB for all filesystems less than 1TB in size, increasing to 8KB for larger filesystems. Changing the block size to 8KB for all filesystems will result in a significant performance improvements over the default - again this needs to be done when creating the filesystem, but for VxFS is done by passing the "-o bsize=8192" option to mkfs.

You can check the block size on an existing filesystem (UFS or VxFS) using fstyp. eg :

# fstyp -v /dev/rdsk/c3t514F0C5294BFC500d1s0 | grep bsize
sbsize  2048    cgsize  8192    cgoffset 64     cgmask  0xffffffc0
bsize   8192    shift   13      mask    0xffffe000

The block size is shown as the "bsize" parameter - in this case 8192 or 8K.

For UFS, the fragment size is also available in fstyp, but it's not shown as an actual size, but instead as a number of fragments per block. eg :

# fstyp -v /dev/rdsk/c3t514F0C5294BFC500d1s0 | grep frag
frag    8       shift   3       fsbtodb 1

This means that there are 8 fragments per block. As we know the block size is 8K, this means that each fragment is 1k. Recreating the filesystem with an 8K fragment size changes this output to show 1 fragment per block :

# fstyp -v /dev/rdsk/c3t514F0C5294BFC500d1s0 | grep frag
frag    1       shift   0       fsbtodb 4