In the recent past, the main characteristic of the central processors for desktop computers, adequately reflecting their level of performance, was the clock speed. Some of our readers probably even remember the times when it was this quantity, not the abstract processor number that was central to the CPU marking. However, since then everything has changed a lot. It turned out that increasing the frequency of processors above 3-4 GHz causes serious problems with heat generation, and so the progress in frequency has stopped, instead, manufacturers began to increase the number of cores. High performance in such chips is provided not so much by the ability to execute instructions with a high rate, as by the presence in the processor of several peer computing cores that offer the possibility of parallel processing.
However, the development of the concept of multi-core requires serious technological efforts – an increase in the number of nuclei leads to a multiple growth of the transistor budget and therefore it is possible only during serious breakthroughs in miniaturization of the norms of the production process. So today we came to a situation where the difference in the speed of the flagship processors of different generations began to be determined for the most part only by changes in their microarchitecture. Indeed, we do not see any tangible progress in frequency since about 2003, and the number of cores in the older models of desktop processors does not grow anywhere from 2009. If we add to this the fact that the new microarchitecture has recently made only timid steps towards improving specific productivity, then a rather dull picture emerges: the state of the desktop processor market can be characterized as clearly stagnant.
Fortunately, this is understood not only by consumers, but also by the developers themselves. Therefore, for example, the last high-performance CPU of Intel, Core i7-5960X, received six standard computing cores, not standard by standards of this class, but already eight. In a number of cases, this really allowed us to raise the speed of desktop systems of the upper price category to a fundamentally new level. However, the expansion of a set of computational resources by one third to some enthusiasts may seem an insufficient measure, especially since in fact Intel can make chips with a much larger number of cores. They do not get into desktops, but for the server market the company has offers with 10, 12, 14, 15, and from the first quarter of this year – with 16 and even with 18 cores. Therefore it is not surprising that there is a small layer of users who does not pay attention to formal positioning and builds their desktop computers on the basis of such monstrous multicore processors. Moreover, configurations with similar chips exist among commercially available computers: for example, a 12-core Xeon class CPU can be equipped with one of the variants of the Mac Pro workstation.
But no matter how tempting the idea of creating a desktop system with a multi-core processor looks, it should be understood that the level of its real performance may be significantly lower than expected. The fact is that not all algorithms can benefit from the possibility of deep parallelization of kernel computations, and there is a whole layer of real problems for which the difference between a CPU with eight and, for example, with twelve computational cores will be insignificant. Do not forget that the concentration of a larger number of cores in the processor chip increases heat generation, forcing to further limit the frequencies of multi-core CPUs. As a result, the use of such chips in a number of situations may not only be not useful, but even harmful.
Given the above, we decided to conduct an experiment and see if the common desktop loads have grown up to the capabilities that modern processors with more than eight cores can provide. On occasion, in our edition was a 12-core Xeon E5 v3 processor of a new generation, based on the Haswell-EP design. We decided to try to assemble with his participation a desktop system similar to the flagship platforms with the older Haswell-E, and to look at the practice, whether this will be of any use.
⇡ # CPU cores and performance scalability
Before taking up tests, it is necessary to say a few words about the fact that the effect of increasing the number of cores in the processor can be estimated even without practical tests. It is quite obvious that chips with a high number of computing cores are effective only in cases when the tasks solved on a computer can be parallelized. Therefore, before implementing the formula “I need more cores,” you need to make sure that all the resources of such a processor can actually be loaded with useful work – only in this case it will be possible to talk about a positive performance scaling. For example, if your applications can not create more than four parallel threads, then a processor with more than four cores will only be useful if you run two or more copies of such programs simultaneously.
However, even with applications with unlimited multithreading, performance gains may not be impressive. The fact is that any, even the most parallel parallel algorithms contain parts that need consistent execution. This factor stands in the way of multiple scalability of performance and is described in the so-called Amdahl law. Formulated by the American mathematician Jean Amdal in 1969, the principle says: “In the case when the task is divided into several parts, the total time of its implementation on a parallel system can not be less than the time for the longest fragment to be performed.” And this means that the acceleration of the execution of the program due to parallelization on a number of processor cores is limited by the time it takes to perform its successive connected parts.
The idea of Amdahl’s law is easiest to explain with a practical example: if 90 percent of the code can be parsed to the ideal parallelization, and the remaining 10 percent can be executed only in single-threaded mode, then the maximum achievable is only a tenfold increase in the speed of the program – no matter how large the number of cores has a processor. That’s why multi-core architectures are most effective in situations where the proportion of consecutive code is minimal. At the same time, the increase in the clock frequency remains a more effective and universal way of improving the performance.
In the following illustration, we present graphs of the accelerated execution of programs theoretically calculated according to Amdahl’s law with fractions of parallel code from 50 to 95 percent for equidistant processors with a number of nuclei from 1 to 20.
Looking at the graph, it is easy to catch an important regularity. With the increase in the number of processing cores in the processor, the resulting performance gain is gradually decreasing. That is, if, for example, a six-core processor allows you to get a fourfold improvement in performance compared to a single-core processor, then this does not mean that the application of the 12-core CPU for solving the same problem will allow you to increase the speed up to eightfold. Moreover, the increase in the number of cores above eight is only meaningful in such problems, where the share of the parallelized code is more than 80 percent. Otherwise, the multiple complication (and, correspondingly, the rise in price) of the CPU will result in only a small gain in productivity, which will not exceed 20 percent.
Based on such calculations, Intel decided to limit the number of cores in desktop processors by eight pieces. Most common tasks solved by desktop computers have a share of parallelized code that does not exceed 80 percent, so using a CPU with a server arsenal of cores is simply not practical. And for the bulk of gaming applications, even if they are optimized for multi-threading, where the share of parallel code rarely exceeds 60 percent, there will be enough processors with four cores. Further deepening of multi-core can add to the performance of only 15-20 percent, which can hardly be called a worthy response to the growth of the number of computing cores by several times.
All the above theoretical calculations are made for the ideal case and do not take into account that the kernels share common resources, such as the cache and RAM, and also have a limited channel for interaction. Therefore, in real life, performance can scale even worse than shown in the graphs. Moreover, with an increase in the number of nuclei above a certain boundary, their performance may even decrease. Here it should also be mentioned that the increase in the complexity of the processor in most cases entails a slight decrease in its limiting clock frequency. Therefore, with the multi-core CPU, there is clearly no reason to pin hopes. The formula “more cores – faster performance” does not always work, and desktop loads are just not a very favorable environment for such scalability. In other words, the increase in the number of cores is quite a specific way to improve the speed of the processor, and it is not suitable for all situations.
⇡ # Xeon E5 v3 and Core i7-5xx0: what’s the difference
The first Haswell-EP series, which resulted from the transfer of server multi-core CPUs to the Haswell microarchitecture, appeared simultaneously with Haswell-E last fall. These chips, officially named Xeon E5 v3, are based on 22nm semiconductor crystals and are designed to be installed in cards with one or more LGA2011-3 connectors that are not compatible with the usual LGA2011 terminals used in the server platforms of the past generations. The main advantages of Haswell-EP in comparison with their predecessors Sandy Bridge-EP and Ivy Bridge-EP is the increase in the maximum number of cores along with the increase in the cache memory capacity, as well as the reduction of heat generation. In addition, Xeon E5 v3 processors have built-in power converter and support for AVX2-instructions, and also inherited a number of other improvements that increase specific performance due to optimizations at the level of microarchitecture.
When we got acquainted with the high-performance desktop platform LGA2011-3 and with the Haswell-E family, we said that it was obtained as a result of adapting the Haswell-EP server processors for desktop systems . This is what determines the existence of family ties between them, which allow the installation of Xeon E5 v3 class processors in motherboards based on the Intel X99 logic set. In the existence of such compatibility, there is nothing strange: such half-server-half-descendant configurations could be created on the basis of processors and platforms of the past generation – nothing has changed with the introduction of a new microarchitecture. Haswell-E and Haswell-EP use the same processor connector, and for the Xeon E5 v3 to work in desktop PCs, only their BIOS support is needed, which board manufacturers usually add at least in their top-price products.
However, between Xeon E5 v3 and Core i7-5xx0 for LGA2011-3, the sign of identical equality can not be put anyway. These processors have important differences: against the background of Haswell-EP desktop design has noticeable simplifications, and older versions of Xeon E5 v3, which have a number of cores more than eight, can boast a more advanced internal design. In particular, for Haswell-EP server processors Intel has prepared three fundamentally different processor chips, but only the simplest of them is used in Haswell-E.
The younger Xeon E5 v3 with up to eight inclusive numbers, as well as the Haswell-E desktops, are based on a 354 mm 2 crystal, consisting of 2.6 billion transistors. Structurally, this crystal consists of two rows of processor cores, between which is a shared cache of the third level. The cores and cache blocks are joined together by a single bi-directional ring bus, to which a four-channel memory controller with DDR4 SDRAM support, a PCI Express 3.0 bus controller and a QPI interface are connected, which is necessary for creating multiprocessor configurations.
Xeon E5 v3 processors, which have a number of cores from 10 to 12, use a more massive semiconductor crystal, including 3.84 billion transistors and occupying an area of 492 mm 2 . In it, the computing cores are arranged in three rows, and the cache is distributed in two areas – between the first two rows of cores and the outer edge of the third row. For switching and data exchange in this case, two peer-to-peer buses are already used. One links the first rows of cores and cache, and the second works with the third row and the extreme area of the cache memory. To exchange data between the ring buses, an additional element appeared in the processor – a buffering switch. This scheme, called “cluster-on-the-core”, reduces the load on the ring buses and provides a higher speed of interaction between the cores and better throughput of the cache memory under multithreaded loads. Moreover, in this version of the processor design, the memory controller is divided into two parts, which are spread over various ring buses. In fact, Xeon E5 v3 processors with more than 10 computing cores have not one four-channel, but two dual-channel memory controllers with support for DDR4 SDRAM.
The most complex version of the Haswell-EP semiconductor chip, which is intended for processors with 14-18 processing cores, is even more convoluted. The area of such a crystal is 662 mm 2 it contains 5.69 billion transistors. In it, the cores are arranged in four rows, between which two areas of the third-level cache memory are laid. As in the previous case, these elements are united by two ring buses, conjugated in a single whole by a buffering switch. The memory controller DDR4 is again divided into two two-channel parts, and the PCIe and QPI controllers are assigned to that of the ring buses that serves a smaller number of cores.
The variant with two independent bi-directional ring buses and a buffering switch in Haswell-EP found application for the first time. Earlier in the multi-core Sandy Bridge-EP and Ivy Bridge-EP processors, the cores were combined with three unidirectional ring buses passing through different core groups. Such a scheme was simpler and was done without any commutation, but it showed its inefficiency in complex multi-threaded tasks in which traffic on the ring buses increased greatly and could lead to unwanted kernel downtime.
One of the key features that appeared in LGA2011-3 processors was the support for a new type of DDR4 SDRAM with large operating frequencies and lower voltage. Compatibility with such memory is both in server and desktop modifications of the CPU, but the DDR4 controllers that the server Xeon E5 v3 has, have slightly different capabilities than there are in the processors of the Core i7-5xx0 class. For server CPUs, the amount of memory that is supported is critical, so they support register registers (RDIMMs) and modules with reduced load (LRDIMM). As a result, if the Haswell-E desktop allows up to 64 GB of memory to be installed with eight unbuffered slats, Haswell-EP server processors can be equipped with an array of twelve LRDIMMs with a total capacity of up to 768 GB or a similar number of RDIMMs with a total capacity of 384 GB. At the same time, the speed of such modules, as well as in the case of desktop platforms, can reach DDR4-2133 mode. Thus, if you take into account the four-channel architecture of the DDR4 SDRAM controller in Haswell-EP, the maximum memory subsystem performance is 68 GB / s per processor.
Another difference between Haswell-E and Haswell-EP is the support for the QPI bus. In server platforms, this bus is used to create interprocessor connections, so it is not present in desktop CPU models. The server Xeon E5 v3 has an active controller QPI 1.1, which, incidentally, realizes two buses with a throughput of 9.6 Gtransfer / s, which is 20 percent higher than the bandwidth of the interprocessor bus in Sandy Bridge-EP and Ivy Bridge-EP.
⇡#Тестовый процессор: Xeon E5-4650 v3
На самом деле, если уж говорить об использовании мощных серверных процессоров в составе настольных конфигураций, то для этой цели лучше бы подошли процессоры серии Xeon E5-2600 v3, имеющие до 18 ядер и формально ориентированные на двухпроцессорные конфигурации. Однако нам выбирать не приходится – компания Intel предоставила нам для экспериментов Xeon E5-4650 v3, ориентированный на четырёхпроцессорные системы. Впрочем, для десктопной платформы с единственным процессорным разъёмом почти никакой разницы нет. Да, Xeon E5-4650 v3, в отличие от Xeon E5-26xx v3, обойдётся несколько дороже, однако в этой статье о стоимости мы говорить не будем.
Xeon E5-4650 v3 – это серверный Haswell-EP средней мощности. Он несёт в себе 12 вычислительных ядер с поддержкой технологии Hyper-Threading, обладает L3-кешем объёмом 30 Мбайт и имеет паспортную тактовую частоту 2,0 ГГц. Обратите внимание, частота этого CPU заметно ниже, чем у десктопного восьмиядерника Core i7-5960X, и это – расплата за возросшее количество ядер. К тому же роль играет и тот факт, что тепловой пакет Xeon E5-4650 v3, ориентированного на использование в составе упакованных в стоечные корпуса многопроцессорных конфигураций, ограничен достаточно консервативной величиной 105 Вт. Впрочем, невысокая паспортная частота отчасти компенсируется режимом Turbo, благодаря которому процессор может разгоняться до 2,6 ГГц при небольших нагрузках и до 2,3 ГГц при нагрузке на все 12 ядер.
Как это ни странно, но такой типично серверный процессор без каких-либо проблем может работать в десктопной материнской плате. Несмотря на то, что для систем на базе Haswell-EP у Intel имеются специализированные наборы логики, например Intel C612, серверные LGA2011-3-процессоры прекрасно себя чувствуют в типично десктопных платах на базе чипсета Intel X99. Мы проверили наш тестовый Xeon E5-4650 v3 в ASUS X99-Deluxe и в ASUS Rampage V Extreme – в обоих случаях никаких препятствий не возникло, да и в списке совместимости на сайте ASUS поддержка Xeon E5 v3 материнскими платами на базе Intel X99 обещана явно. Более того, как показала практика, Haswell-EP легко находит общий язык и с обычной небуферизованной DDR4 SDRAM, так что десктопная платформа с процессором серии Xeon E5 v3 на самом деле не требует никаких специальных комплектующих серверного класса. Иными словами, если вы решитесь собирать настольную систему с многоядерным серверным CPU, дополнительных трат не потребуется – раскошелиться придётся только на сам процессор. Например, цены 12-ядерных Haswell-EP начинаются с отметки в $1 500.
Выше мы подробно расписали преимущества Xeon E5 v3 перед Core i7-5хх0, главными из которых являются увеличенное число вычислительных ядер и более продуманная схема внутренних коммуникаций. Однако у серверных процессоров есть и явные минусы, которые наверняка огорчат энтузиастов. Дело в том, что десктопные Haswell-E благосклонны к разгону и имеют незаблокированные множители, позволяя произвольно варьировать частоту вычислительных ядер, кеша и памяти. Серверные же процессоры такие вольности пресекают на корню. Поднять частоту выше величин, предусмотренных технологией Turbo Boost, невозможно. Не допускается также и тактование памяти в режимах, превосходящих DDR3-2133. И это значит, что на любом разгоне Xeon E5 v3 можно поставить крест.
Таким образом, несмотря на то, что изначально идея использования процессора Xeon E5 v3 в десктопной платформе выглядела очень многообещающе, по мере более глубокого погружения в тему мы сталкиваемся с всё возрастающим числом аргументов против. И хотя никаких явных препятствий для создания серверно-десктопного гибрида пока не просматривается, похоже, что это попросту нецелесообразно. Впрочем, не будем забегать вперёд — перед тем как делать какие-то выводы, давайте ознакомимся с результатами тестов.