hyperthreading

Hyper-Threading and Multi-Core

Threads

Consider the problem of cooking for a big dinner party. Each dish has its own recipe. You could follow the instructions in one recipe until that one dish is done, then set it aside and start the next dish. Unfortunately, it would take several days to cook the dinner, and everything would come out cold. Fortunately, there are long periods of time when something sits in the oven, and while it is cooking you can prepare one or two other things.

A sequence of instructions to do one thing is called a "recipe" in the kitchen, and a "thread" in computer programming. A computer user intuitively understands the behavior of threads when running several programs on the screen, or when listening to an MP3 file in the background while typing a letter into the word processor. Even a single program can make use of threads. The Browsers has separate threads for every file or image you are downloading, and it may assign a separate thread to decode each image or banner ad that appears on the screen when you visit the New York Times web site.

Some short operations have a very high priority. For example, a pot of rice you just started has to be checked every 30 seconds of so to see if it has come to a full boil. At that point the heat can be turned down, the pot can be covered, and now you can forget it for 15 minutes. However, if you don't check it regularly at first, it will boil over, make a mess on the stove, and you have to start over.

Computer programs also assign a priority to their threads. As with cooking, high priority can only be assigned to trivial tasks that can be accomplished in almost no time at all. Just as a kitchen has to have timers, and a beep when the microwave is done, so the operating system has to have support for program threads and the ability to connect them to timers and to events signaled when data arrives from the network or another device.

In the kitchen, each task you perform has its own set of tools. To chop carrots, you need a knife and a cutting board. To take something from the oven, you need oven mittens. It takes some small amount of time to set down what you are doing and change. If you don't change, you will find it is very difficult to cut carrots while wearing oven mittens.

Each thread in the computer stores its status and data in the CPU chip. To switch threads, the operating system has to take this data out of the CPU, store it away, and load up data for the other thread. Switching from one thread to another takes a few hundred instructions, but this is not a problem when the CPU can execute billions of instructions a second while a hard drive or network performs only about 30 operations per second. The overhead of thread switching for I/O is trivial.

If it is a big complicated dinner that one person can simply not get done in time, you need some help. Specific tasks can be assigned to different people. The threads don't change. The bread is still cooked the same way whether there is one person in the kitchen or two. With two people, however, one can chop carrots while the other peels potatoes.

A single core CPU runs only one thread at a time. However, the CPU runs so fast that it can switch threads hundreds of times a second. From the point of view of a human user, many different programs are running at once. If the human could speed up to match the speed of the computer, he would see that only one thing is running at a time. However, many current computers support more than one CPU core. Each core is an independent CPU and can run any thread in the system. When threads are ready to run, the operating system will assign one thread to every available CPU core. Now two or more threads are actually running concurrently.

Hyper-Threading

As has already been noted, memory delay has become an important problem for computer performance. When an instruction requires data that is in second level cache, it may have to wait a cycle or two. During this time, the CPU will look for other instructions that do not depend on the result of the blocked instruction and execute them out of order. However, out of order execution is at best good for a dozen instructions. When an instruction needs data from DDR DRAM, it will be blocked for a length of time during which the CPU could have run hundreds of instructions.

In 2004, Intel tried to address this memory delay problem with a trick called Hyper-Threading. When the operating system assigns a CPU core to a thread, it has to load the registers and status of the thread into the CPU. Later on it has to save the registers and status back into memory before assigning a different thread to the core. With Hyper-Threading, each CPU core appears to be two separate processors capable of running two different threads. However, the only hardware that is actually duplicated are the registers and status flags. The core can really only execute one of the two threads at a time. It can switch between the two sets of registers belonging to the two different threads almost instantly.

The core spends half its time running one thread and half running the other. However, if an instruction in one thread blocks waiting for data in main memory, then after all the out of order instructions have been processed for this thread, the core can switch over to the other thread and process its instructions. While a few dozen instructions in the blocked thread may not depend on the results generated by the blocked instruction, by definition every instruction in another thread is independent of all the instructions in the blocked thread. Hyperthreading gives the core something else to do while waiting for data from main memory.

Hyperthreading was dropped from the first generation of multicore processors. A second real core is better than a second pretend core. However, the next generation of Nehalem or Core i7 processors have both Hyperthreading and Multi-core.

This creates problems and opportunities for new generations of smarter operating systems. A 4 core Nehalem processor allows two threads to be assigned to every core. If four threads happen to be ready to execute at this moment, how should they be assigned to the hardware. You might think that a real core is always better than a hyperthread simulated core, so the four program threads should each be assigned to real core processors. Certainly that provides the best performance and it would have been the obvious choice a few years back. However, in the modern low power world there are other considerations.

The operating system takes a long term view of processor load. While sometimes there may happen to be four threads ready to execute, most times there are only one or two and often no program is ready to execute. When the system is lightly loaded, the OS makes a global optimization decision to stop sending work to one or more cores and allow the CPU chip to power them down in "sleep" mode. After this decision, the system will occasionally have to queue up a thread waiting for one of the reduced number of active cores to become available. However, remember that the end user is a human who will not notice if a thread occasionally has to wait for a millionth of a second longer that it would have had all the cores been running.

If the load increases to the point that delays might become important, then the OS will begin to send work to the previously sleeping core and the CPU will have to power it up. The CPU makes this decision based on the total load observed over seconds, not the occasional congestion that lasts a few microseconds.

Nehelem offers hyperthreading so the OS can decide for short periods of time to load four program threads into two active core of the CPU while the other two cores are in power down sleep mode. Vista and Windows 7 then promise to make smart decisions to optimize performance and power savings at the same time.

Co(re)ordination

Two programs are running on your computer. While they mostly do different things, they may both store data on the same disk and they both display output on the same screen. Internally, the operating system must coordinate their concurrent access to shared resources. At the hardware level, each CPU core must coordinate access to memory and to the I/O devices.

Even today Intel is stuck with an old design that dates back to the days when multiple processor cores required multiple CPU chips. The chips coordinated their activity through the single Northbridge chip on the mainboard.

Years ago AMD came up with a better design. The Opteron and Athlon 64 chips had a processor core, memory controller, I/O bus, and (on server models) a connection to other CPU chips. All of these components were connected to each other through a shared component called the XBar. The XBar served inside the chip the same purpose as an Ethernet connecting different computers in a room. Any component could talk to any component.

So when multicore systems became popular, AMD could simply add a second core to the same XBar. Now there were two cores, one memory controller, one I/O bus, and maybe some connections to other chips.

Intel doesn't have an XBar today, so two processor cores in the same chip cannot talk directly to each other, but must use the front side bus to talk to the Northbridge chip to then come back through the same front side bus to talk to the other core.

In late 2008, the "Nehalem"  generation of CPU chips from Intel will adopt the AMD design and will come with their own embedded memory controller and XBar component. This will probably end the last big technical advantage that AMD now has.

 Copyright 1998, 2007 PCLT -- Introduction to PC Hardware -- H. Gilbert

Labels

 
(None)