Windows Data Alignment on IPF, x86, and x86-64

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vcconwindowsdataalignmentonipfx86x86-64.asp?frame=true

Windows Data Alignment on IPF, x86, and x86-64

Kang Su Gatlin
Microsoft Corporation

February 2003

Applies to
   Microsoft® Visual C++®
   Microsoft® Windows® XP application development
   Microsoft® Windows® Server 2003 application development

Summary: Gives developers the information needed to confront data alignment problems critical to the performance of 64- and 32-bit applications developed for the Microsoft Windows XP and Microsoft Windows Server 2003 platforms. (17 printed pages)

Contents

Introduction
What Is Data Alignment?
Why Is Alignment a Concern?
Data Alignment Exceptions and Fix-Ups
Compiler Support for Alignment
Some Quick Tips on How to Avoid Alignment Issues
What about Instruction Alignment?
Conclusion

Introduction

Intel® and AMD® have introduced a new family of processors, the Intel Itanium® Processor Family (IPF) Architecture and the AMD x86-64 Architecture. These processors join the IA-32 Intel Architecture family in the Microsoft® Windows® desktop/server world. With Microsoft® Visual C++® and Microsoft Windows on these platforms, you can get incredible performance, but this good performance is contingent upon certain programming practices. One of these programming practices is proper data alignment. Proper data alignment allows you to get the most out of your 64- and 32-bit applications—and on the Itanium, it's often not only a matter of performance, but can be a matter of correctness.

In this document we explain why one should care about data alignment, the costs if you do not, how to get your data aligned, and what to do when you can't. You'll never look at your data access the same way again.

What Is Data Alignment?

All variables have two components associated with them: 1) their value, and 2) their storage location. In this article our concern is the storage location. The storage location of a variable is also called its address, and is the integer (the mathematical term, integer, not the data type) offset in memory where the data begins. The alignment of a given variable is the largest power-of-2 value, L, where the address of the variable, A, modulo this power-of-two value is 0, that is, A mod L = 0. We will call this variable L-byte aligned. Note that when X > Y and both X and Y are power-of-two values, a variable that is X-byte aligned is also Y-byte aligned.

In Listing 1 we give a code example to illustrate where variables get stored/aligned. Don't worry if you don't understand why things are aligned where they are. You'll understand all of this by the end of the paper. We do encourage you to have fun and play with the example (reorder the local variables and class member variables and see what happens to the addresses).

Listing 1. Data alignment example

#include <stdio.h>
int main()
{
   char a;
   char b;
   class S1 
   {
      public:
      char m_1;             // 1-byte element
                  // 3-bytes of padding are placed here
      int m_2;           // 4-byte element
      double m_3, m_4;      // 8-byte elements
   };
   S1 x;
   long y;
   S1 z[5];
   
   printf("b = %p/n", &b);
printf("x = %p/n", &x);
printf("x.m_2 = %p/n", &x.m_2);
printf("x.m_3 =  %p/n", &x.m_3); 
printf("y = %p/n", &y);
printf("z[0] = %p/n", z);
printf("z[1] = %p/n", &z[1]);
   return 0;
}

In Listing 2 we show the output of what Listing 1 might print. Remember that this is just what it prints on my computer. Your computer will almost certainly print different numbers. That's to be expected.

Listing 2. Output from example in Listing 1

b =       000006FBFFB8FEB1
x =       000006FBFFB8FE98
x.m_2 =    000006FBFFB8FE9C
x.m_3 =     000006FBFFB8FEA0
y =       000006FBFFB8FE90
z[0] =    000006FBFFB8FEB8
z[1] =    000006FBFFB8FED0

So from this example in Listings 1 and 2, you can now see how each of the variables are aligned. The char, b, is aligned on a 1-byte boundary (0xB1 % 2 = 1). The class, x, is aligned on an 8-byte boundary (0x98 % 8 = 0). The member, x.m_2, is aligned on a 4-byte boundary (0x9C % 8 = 4). x.m_3 is on an 8-byte boundary, as is y. z[0] and z[1] are also 8-byte aligned (we omit the modulo math for those last sets of variables, as it is straightforward).

If we look at the class S1, we see that the whole class has become 8-byte aligned. The packing within the class is not optimal, as there exists a gap of 4 bytes between element x.m_1 and x.m_2, yet x.m_1 is merely a 1-byte element.

The Itanium and x86-64 compilers provide for data items of natural lengths of 1, 2, 4, 8, 10, and 16 bytes. All types are aligned on their natural lengths, except items that are greater than 8 bytes in length. Those are aligned on the next power-of-two boundary. For example, 10-byte data items are aligned on 16-byte boundaries. The x86 compiler supports aligning on boundaries of the natural lengths of 1, 2, 4, and 8 bytes.

Next we give a relatively simple way to determine the alignment of a given type. To do this, use the __alignof(type) operator. (The macro equivalent is TYPE_ALIGNMENT(type)). This operator returns the alignment requirement of the variable/type passed to it.

Stack Alignment

On both of the 64-bit platforms, the stack is 16-byte aligned. While this uses more space than is needed, it guarantees that the compiler can place all data on the stack in a way that all elements are aligned.

The x86 compiler uses a different method for aligning the stack. By default the stack is 4-byte aligned. While this is space-efficient, you can see that there are some data types that need to be 8-byte aligned, and to get good performance, 16-byte alignment is sometimes needed. The compiler can determine, on some occasions, that dynamic 8-byte stack alignment would be beneficial—notably when there are double values on the stack.

The compiler does this in two ways. First, the compiler can use link-time code generation (LTCG), when specified by the user at compile and link time, to generate the call-tree for the complete program. With this it can determine regions of the call-tree where 8-byte stack alignment would be beneficial, and it determines call-sites where the dynamic stack alignment gets the best payoff. The second way is used when the function has doubles on the stack, but for whatever reason, has not yet been 8-byte aligned yet. The compiler applies a heuristic (which improves with each iteration of the compiler) to determine if the function should be dynamically 8-byte aligned.

Note   A downside to dynamic 8-byte stack alignment, with respect to performance, is that frame pointer omission (/Oy) effectively gets turned off. Register EBP must be used to reference the stack with dynamic 8-byte stack and thus can't be used as a general register in the function.

Structure and Union Layout

The layout with respect to alignment in structures and unions is dependent on a few simple rules. We can break structure and union alignment into two components: inter-structure/union alignment and intra-structure alignment. (There is no intra-union alignment.)

Inter-structure/union alignment is the simpler case. The rule here is that the compiler aligns the structure with the largest alignment requirement of any of the members of the structure. Unions follow the rule that the union is aligned based on the alignment requirement of the first member (lexically) of the union.

Intra-structure alignment works by the principle that the members are aligned by the compiler at their natural boundaries, and it does this through padding; inserting as much padding as necessary up to the padding limit. The padding limit is set by the compilation switch /Zpn. The default for this switch is /Zp8.

The programmer can use the #pragma pack at the point of declaration of the structure to also set the padding limit from that point in the translation unit onward. That is, it doesn't affect structures declared prior to the #pragma pack. Access to structure members that are packed may result in access to data that is unaligned. The compiler inserts the fix-up code for these members, which means that the access won't result in an exception, but it will result in slower and more bloated code. (The fix-up code and exception may not make sense yet, but you'll understand it by the end of this article.)

The padding limits (#pragma pack and /Zpn) should be used with care. Unless most of your work consists of simply moving data without reading or writing particular elements, or you're space constrained, the tradeoffs involved with using padding limits that violate the alignment rules usually don't work in the programmer's favor.

Why Is Alignment a Concern?

Okay, so now you know what it means for a variable to be aligned. Why do we care about alignment? Well as you may have guessed, the reason is performance, and on the Itanium platform, the reason is correctness as well, due to the way misalignment is handled. Now the question is, why? What is the underlying reason that we care about alignment? Certainly no computer architect arbitrarily decided to make our lives difficult. No, but these alignment issues are in fact a remnant of architectural tradeoffs made by computer architects.

On most modern RISC-based designs, data can only be accessed at the boundary defined by the natural length of the data being requested. This fills the destination register with the data of that length. The implication of this is that the computer gets data in natural-length chunks from addresses that are a product of the natural length. What this further implies is that reading data from addresses that are not a product of the natural length will be problematic (may slow down or crash the application).

For example, a 32-bit computer with a word boundary starting at 0, can load data from bytes at location 0 to 3 in one load, or 4 to 7 in one load, or 40 to 43 in one load, but NOT 2 to 5 in one load (as bytes 2 to 5 span two words). What this means is that if the computer actually needed to retrieve the 32-bit value from location 2 to 5, it would have to retrieve the data from 0 to 3 and also retrieve the value from location 4 to 7, and then perform some operations to properly extract and shift the bytes that it needs. Depending on the computer system, either the operating system or compiler does this for you. If they don't, then the hardware can raise an exception (and you don't want that to happen; worse case, it could crash). When the software bails you out, this requires not only some extra logic, but it takes extra memory accesses. In fact, for many applications on modern computers, the memory system is the performance bottleneck, thus making extra memory requests can be very costly. In the particular example of this paragraph, it will take two memory accesses to get the 32-bit value from 2 to 5, rather than the one memory access it would take to get the 32-bit value from an aligned address. See Figure 1, as a visual representation might help to make more sense of this potentially tricky topic.

Figure 1. A graphic representation of loading bytes at addresses 2 to 5

Figure 1 shows: a) loading the first word (bytes 0 to 3); b) extracting bytes 2 to 3 from the loaded word; c) loading the second word; and d) extracting the first two bytes from the second loaded word and appending it to the previously extracted bytes.

This notion of data alignment extends beyond the word-size of the given computer architecture, and up the memory hierarchy through the multiple levels of cache, translation lookaside buffer, and pages. Each of these, like the 32-bit words, has an associated unit chunk size. Caches have cache lines that are on the order of 32 to 128 bytes. Pages go from 1024 bytes to megabytes in size. This is all done to make our programs perform more efficiently. We just need to know how to deal with it when it bites us.

Data Alignment Exceptions and Fix-Ups

The obvious way to deal with alignment issues is to avoid them, but in the real world, that isn't always possible. To help generate correct programs, Microsoft Visual C++ and Microsoft Windows have some mechanisms to help the programmer. These don't come without some performance impact, but they do assist in rapid development and/or porting of applications.

The first question that comes to mind might be, "What if I violate the alignment restrictions?" That is, what happens if I generate an alignment fault? Well, a few things can happen, and none of them are good.

In Windows, an application program that generates an alignment fault will raise an exception, EXCEPTION_DATATYPE_MISALIGNMENT. On the Itanium, by default, the operating system (OS) will make this exception visible to the application, and a termination handler might be useful in these cases. If you do not set up a handler, then your program will hang or crash. In Listing 3 we provide an example that shows how to catch the EXCEPTION_DATATYPE_MISALIGNMENT exception.

Listing 3. Code to catch alignment exception on Itanium

#include <windows.h>
#include <stdio.h>

int mswindows_handle_hardware_exceptions (DWORD code)
{
   printf("Handling exception/n");
    if (code == STATUS_DATATYPE_MISALIGNMENT)
   {
      printf("misalignment fault!/n");
         return EXCEPTION_EXECUTE_HANDLER;
   }
    else 
      return EXCEPTION_CONTINUE_SEARCH;
}

int main()
{
   __try {
   char temp[10];
   memset(temp, 0, 10);
   double *val;
   val = (double *)(&temp[3]);
   printf("%lf/n", *val);
   }
   __except(mswindows_handle_hardware_exceptions (GetExceptionCode ())) {}
   
}

The application can change the behavior of the alignment fault from the default, to one where the alignment fault is fixed up. This is done with the Win API call, SetErrorMode, with the argument field SEM_NOALIGNMENTFAULTEXCEPT set. This allows the OS to handle the alignment fault, but at considerable performance cost. Two things to note: 1) this is on a per process basis, so each process should set this before the first alignment fault, and 2) SEM_NOALIGNMENTFAULTEXCEPT is sticky, that is, if this bit is ever set in an application through SetErrorMode then it can never be reset for the duration of the application (inadvertently or otherwise).

On the x86 architecture, the operating system does not make the alignment fault visible to the application. On these two platforms, you'll also suffer performance degradation on the alignment fault, but it will be significantly less severe than on the Itanium, as the hardware will make the multiple accesses of memory to retrieve the unaligned data.

On the x86-64architecture, the alignment exceptions are disabled by default, and the fix-ups are done by the hardware. The application can enable alignment exceptions by setting a couple of register bits, in which case the exceptions will be raised unless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD x86-64 Architecture Programmer's Manual Volume 2: System Programming.)

With that said, there are situations on the x86 and x86-64 platform where unaligned access will generate a general-protection exception. (Note that these are general-protection exceptions and not alignment-check exceptions.) This is when the misalignment occurs on a 128-bit type, specifically SSE/SSE2-based types.

In some experimental runs, with the code in Listing 4 (we used 9,000,000 iterations, with 0 and 3 offset representing aligned and unaligned respectively), we saw that on a slower Pentium III (731MHz, running Microsoft® Windows® XP Professional), the program with the unaligned access runs about 3.25 times slower than the program with the aligned access. On a faster Pentium IV (2.53GHz, running Windows XP Professional) the program with an unaligned access runs about 2 times slower than the program with the aligned access.

This is definitely not the type of performance hit you want to take. Unfortunately, it gets even worse on the Itanium Processor Family. With the same test, running on an Itanium2 at 900MHz with Microsoft® Windows® Server 2003 (but only for 90,000 iterations due to how long the test takes to run), the unaligned program runs 459 times slower! As you can see, unaligned access in an inner-loop can devastate the performance of your application.

So even with the OS fix-up, which prevents your application from crashing, one should avoid unaligned access.

Listing 4. Example code sample to compare OS fix-up unaligned vs. aligned

#include <stdio.h>
#include <stdlib.h>
#include <sys/timeb.h>
#include <time.h>
#include <windows.h>

#ifdef _WIN64
#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif

int main(int argc, char* argv[])
{
   SetErrorMode(GetErrorMode() | SEM_NOALIGNMENTFAULTEXCEPT);
   UINT iters, offset;
   if(argc < 2)
      iters = 9000000;
   else
      iters = atoi(argv[1]);

   if(argc < 3)
      offset = 0;
   else 
      offset = atoi(argv[2]);

   printf("iters = %d, offset = %d/n", iters, offset);

   double *dest, *origsource;
   double *source;
   dest = new double[128];
   origsource = new double[150];

   source = (double *)((UINT)origsource + offset);
   printf("dest = %x  source = %x/n", dest, source);

   LARGE_INTEGER startCount, endCount, freq;
   QueryPerformanceFrequency(&freq);
   QueryPerformanceCounter(&startCount);

   for (UINT x = 0; x < iters; x++)   
      for(UINT i = 0; i < 128; ++i)
         dest[i] = source[i];

   QueryPerformanceCounter(&endCount);
   printf("elapsed time = %lf/nTo keep stuff from being optimized %lf/n", 
    (double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART, dest[75]);
   delete[] origsource;
   delete[] dest;
   return 0;
} 

Compiler Support for Alignment

Sometimes, through explicit syntax, the compiler can help with these alignment issues. In this section we give a few extensions that you can use in the source code to either minimize the cost of unaligned access or to help ensure aligned access.

__unaligned keyword

As we stated earlier, the compiler by default will align data on their natural boundaries. Most of the time this is sufficient and there won't be a problem, but there can be situations where an alignment issue will exist with no clear way to work around it (or it would take too much effort to do so).

When you, the programmer, can determine statically which variables might be accessed on unaligned boundaries, you can specify these variables as being unaligned with the __unaligned keyword (the macro equivalent is UNALIGNED). This keyword is useful in that the compiler will insert the code to access the variable on an unaligned boundary, and it won't fault. It does this by inserting extra code that will finesse its way around the unaligned boundary—but this does not come for free. These extra instructions will slow your code down, plus increase the code size. Unfortunately, these extra instructions are generated even in places where it might be provable that the data is aligned! So use this keyword with care.

We can modify the program of Listing 4 by using the __unaligned keyword in a variable declaration. In this example we change the declaration of source to the following:

 __unaligned double *source;

This program will now run correctly on the Itaniums even if you don't enable the operating system to fix-up the alignment faults, although it will suffer some performance degradation. This is still better than having your program crash or suffer the severe performance penalty of the OS fix-up. (Keep in mind that as noted earlier, the compiler inserts code to handle misaligned access even where it is provable that the data is aligned. The OS only goes into its fix-up code when an exception occurs, and these only occur when the misaligned access actually happens.)

In Figure 2 we have a chart that gives the running time on an Itanium 2 for the example program of Listing 4 when using various data access methods. The program executes fastest when the data is aligned and the __unaligned keyword is not used. It runs next fastest when the data is aligned, but the __unaligned keyword is used. (Recall that if you use the __unaligned keyword, you pay a performance penalty even if your data is aligned.) You run slightly slower if you use the __unaligned keyword on unaligned data. Lastly, you run much slower if you access unaligned data, but have set SetErrorMode with SEM_NOALIGNMENTFAULTEXCEPT.

Figure 2. Comparative runtimes of test program to illustrate effect of different types of accesses. Note that the y-axis is on a log10 scale.

__declspec(align(#))

So we've dealt with the problem of a variable that you know is going to have unaligned access, but what about when you have a variable and you'd like it to be allocated on a boundary that is different than its natural boundary? For example, when using SSE2 instructions, you may want to align your operands on a 16-byte boundary, or you may want to align certain variables on cache-line boundaries. __declspec(align(#)) is made for such purposes (where # is a power of two). In Listing 5 we give an example of its use.

Listing 5. Code demonstrating how __declspec(align(#)) works

#include <stdio.h>

class ClassA {
public:
   char d1;
   __declspec(align(256)) char d2;
   double d3;
};

int main()
{
   __declspec(align(32)) double a;
   double b;
   __declspec(align(512)) char c;
   ClassA d;

   printf("sizeof(a) = %d, address(a) = %0x/n", sizeof(a), &a);
   printf("sizeof(b) = %d, address(b) = %0x/n", sizeof(b), &b);
   printf("sizeof(c) = %d, address(c) = %0x/n", sizeof(c), &c);
   printf("sizeof(d) = %d, address(d.d2) = %0x/n", sizeof(d), &d.d2);
   return 0;
}

The output might look something like the below (taken from my computer):

sizeof(a) = 8, address(a) = 12fde0
sizeof(b) = 8, address(b) = 12fdd8
sizeof(c) = 1, address(c) = 12fa00
sizeof(d) = 512, address(d.d2) = 12f900

Note the sizeof of the class. The sizeof value for any structure/class is the offset of the final member, plus that member's size, rounded up to the nearest multiple of the largest member alignment value or the whole structure/class alignment value, whichever is greater. (This definition is taken from MSDN's entry on align.)

The CRT and Intrinsics

__declspec(align) is a useful tool, but it cannot align dynamic data off of the heap. For this the C runtime library (CRT) gives a set of aligned memory allocation routines. These are listed below (and come with <malloc.h>):

  • void *_aligned_malloc(size_t size, size_t alignment)
  • void *_aligned_offset_malloc(size_t size, size_t alignment, size_t offset)
  • void _aligned_free(void *aligned_block)
  • void *_aligned_realloc(void *aligned_block, size_t size, size_t alignment)
  • void *_aligned_offset_realloc(void *aligned_block, size_t size, size_t alignment, size_t offset)

See Data Alignment on MSDN for more information on these routines.

One of the best ways to get performance is to use code that programmers have spent a lot of time tuning. The supplied CRT memory routines (strncpy, memcpy, memset, memmove, and so on) are a great example of this. The CRT routines are hand-code routines (often assembly) that are tuned to the particular architecture, which will align the source and destination so that for large moves, the costs of the unaligned accesses are minimized.

Alternatively, the user can use the /Oi flag or the #pragma intrinsic(functions) pragma, which enables generation of intrinsics. (Note that the /Oi flag is implied by the /O2 flag.) Intrinsics are inlined routines emitted by the compiler that are generally not as well tuned as the assembly language CRT routines. They do avoid the overhead of the function call at the additional cost of code bloat. It's also worth noting that using /Oi or #pragma intrinsic is a suggestion to the compiler, and the compiler is free to emit intrinsics or the CRT routines. Looking at the assembly code is a good way to determine which was generated.

The IPF compiler will also use type information to assist in expanding the inline intrinsics. The compiler will examine the types of pointers to the source and destination addresses, and from this will infer the alignment of these addresses. If the pointer types are not correct, you might take an alignment exception or the program will run slower (with the dreaded OS fix-ups).

In Listing 6 we give code to show the effects of aligned versus unaligned accesses on code that uses the compiler intrinsics for memcpy or the CRT assembly language hand-tuned routines. To use the CRT assembly language hand-tuned routines, make sure to insert the #pragma function(function) pragma.

Listing 6. Code to demonstrate the effect of intrinsic and CRT routines on aligned vs. unaligned accesses

#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <windows.h>

#ifdef _WIN64
#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif

#pragma function(memcpy) // comment out this line for intrinsic generation.

int main(int argc, char *argv[])
{
   int iters1 = atoi(argv[1]);
   int size1 = atoi(argv[2]);
   int offset = atoi(argv[3]);
   char *source, *origsource = (char *)_aligned_malloc(size1, 8);
   char *dest, *origdest = (char *)_aligned_malloc(size1, 8); 
   source = (char *)((UINT)origsource + offset);
   dest = (char *)((UINT)origdest + offset);
   LARGE_INTEGER startCount, endCount, freq;
   QueryPerformanceFrequency(&freq);
   QueryPerformanceCounter(&startCount);
   for(int i = 0; i < iters1; ++i) 
      memcpy(dest, source, size1-offset);
   QueryPerformanceCounter(&endCount);

   printf("&source = %0x /t &dest = %0x/n", source, dest);
   printf("elapsed time = %lf/nTo keep stuff from being optimized %lf/n", 
    (double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART, dest[1]);
   _aligned_free(source);
   _aligned_free(dest);
}

Figure 3. The time to perform a memcpy using aligned vs. unaligned data and CRT vs. intrinsic routines on a Pentium III

Figures 3 and 4 show the relative performance of each of the four configurations on memcpys of various size—on a Pentium III and Itanium2 computer respectively. We generated this data with the code from Listing 6 using the following parameters:

exename 1000000 size offset

Where 8 ≤ size ≤ 4096 and 0 ≤ offset ≤ 1.

On the Pentium III, for aligned copies, it doesn't matter too much if you use CRT or intrinsic. But for large unaligned copies, using the CRT version is a big win. On the Itanium2 we only compare the CRT versions, as the compiler almost always uses the CRT versions, even when the programmer specifies /Oi or #pragma intrinsic. In Figure 4 we compare unaligned versus aligned CRT calls. You can clearly see that using aligned data results in better performance. The lesson here isn't subtle at all.

Figure 4. The time to perform a memcpy using aligned vs. unaligned data with CRT routines on an Itanium2

Some Quick Tips on How to Avoid Alignment Issues

If you're short on time, and just want a quick section to refer to, you've found the right place. Here are some quick tips to help deal with data alignment related issues:

  1. When casting from an aligned pointer P1 to a pointer P2, where the TYPE_ALIGNMENT(P1) < TYPE_ALIGNMENT(P2), you must ensure that all accesses are properly aligned. Using P2 to dereference addresses originally pointed to by P1 may result in an alignment fault. But if TYPE_ALIGNMENT(P1) > TYPE_ALIGNMENT(P2), then P2 is fine to dereference all elements, element-wise, that it points to.
  2. Do not pack structures unless you're sure the space savings is a win, for example, if you're simply transporting the structure around and never accessing individual members.
  3. Understand what boundaries you need to align data on. Not having your alignment high enough can lead to alignment problems, but setting the alignment too high can lead to data bloat.

What about Instruction Alignment?

Well you're almost to the end of this article, and some of you may be wondering, "You've talked about data alignment, what about instruction alignment? Aren't instructions also stored in memory?" The answer is, instruction alignment is also an issue, but one not covered in this article, as most programmers don't have to deal with it at all. Instruction alignment is mostly an issue for compiler writers. The one type of general-purpose programmer who might still care about instruction alignment would be the assembly-language programmer, especially if he or she is not using an assembler.

Conclusion

Hopefully, you will now feel confident that you know the ins and outs of data alignment when you sit down to do Windows development. This article has covered how to avoid many data-alignment faults, what to do when they are inevitable, and also the various costs associated with them. This knowledge will be useful for all Windows development, but will prove especially useful when porting code from x86 to Itanium, where data alignment plays a front-and-center role. In the end, the result will be faster, more reliable code.

About the Author

Kang Su Gatlin is a Program Manager at Microsoft in the Visual C++ group. He received his PhD from UC San Diego. His focus is on high-performance computation and optimization.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章