The announcement of the generation of Skylake takes place in this somewhat unusual way. First of all, Intel introduced the Core i7-6700K and i5-6600K – older desktop models for enthusiasts – and even started selling them, but the output of mass processors for desktop computers, as well as for other market segments, was postponed for a somewhat later period. Following this schedule, at the time of the announcement of the overclocker flagships, it was not possible to understand why it should be talked about the peculiarities of the new microarchitecture, therefore, turned out exactly like that that
However, within the framework of the IDF 2015 session held in San Francisco, Intel decided to fill in the information. Unfortunately, the story turned out to be as complete as it should have been, the developers promised to lay down all the information on the server Xeon processors. But the published information is enough to get a general idea of what Skylake differs from Haswell and Broadwell in terms of microarchitecture. And in this article we will try to reproduce the basic facts, that is, let us, in effect, test the representatives of the Skylake-S family with the missing theoretical basis.
It has long been believed that the main priority of the developers in the design of new microarchitectures is the desire to reduce the power consumption and improve the specific performance for each watt spent. And at first glance, with Skylake, little has changed in this respect – the Israeli team of engineers that developed this microarchitecture over the past five years proceeded from the same introductory notes. However, there is an important nuance: when designing this generation of processors, developers tried not only to reduce the consumption, but tried to take into account that such processors should be used in the end products with differing thermal packages, starting from 4.5 W to 95 W. The applied approach immediately took into account the fact that the new microarchitecture should fit well both in highly economical and high-performance designs. In other words, since Skylake, Intel has decided to gradually distance itself from the previous strategy, when economics is put at the forefront of the development of new microarchitectures, and not only mobile processors but also solutions for desktops and servers are made on the basis of the energy-efficient design
However, the variation in consumption and thermal packages of processors is by no means the only goal facing the engineers. The growing market segment of ultraportable devices dictates other conditions that should be taken into account in the processor design. For example, the size of chips and related schemes is of great importance. It's logical that economical 4.5-watt processors, along with the set of system logic and a motherboard that are intended for tablet computers, should be as easy as possible. Therefore, one more reference point for development was the mass-dimensional characteristics, which, like power consumption, had to be allowed to read – for the Skylake microarchitecture to fit seamlessly into both ultraportables and desktop computers without problems
And the assigned tasks were solved. Mobile versions of Skylake, managed in the performance of the system-on-chip, managed to make it noticeably smaller in size compared to Haswell and Broadwell. And optimization in terms of power consumption has to to increase the performance of energy-efficient chips while maintaining the old framework of thermal packages. However, all this in itself is quite an ordinary manifestation of technological progress. The other thing is striking: the scale of the changes. Consumption of Skylake processors of different classes can now divide by 20 times, and their physical sizes can vary within fourfold limits. Also, Intel engineers place themselves in the credit and a noticeable increase in the cost-effectiveness of next-generation systems-on-a-chip, reaching 40-60 percent on typical multimedia tasks and in a state of inactivity.
In general, despite the fact that Intel speaks of the scalar as a universal microarchitecture, according to the tradition, the main beneficiary of its commissioning are chips for ultraportables. For example, modern systems-on-a-chip require the support of specific buses and interfaces: smartphones and tablets for data storage are actively using eMMC and SDXC devices, the CSI interface for connecting a camera and the like. And many of these interfaces are now built directly into the Skylake chipsets, which can co-exist with the processor in a single semiconductor chip. But the most interesting is that such changes affected the base of the processor. So, directly in the microarchitecture Skylake, a new block with fixed functions was registered – a signal processor for image processing.
It supports up to four cameras with 10 Mp resolution, two of which can be active at the same time, and provides a hardware signal processing that includes both simple video capture (with 1080p60 or 2Kp30 resolution) and advanced functions like face recognition, panoramas, the construction of HDR-images and so on.
⇡ # Basic microarchitecture
However, do not think that by creating such a universal processor, Intel has forgotten about improving the basic microarchitecture. After all, Skylake belongs to the "so" development phase, therefore, quite serious changes affected directly the computing cores. True, the principle of design, which has been in use since Haswell, which only gives solutions to solutions that can improve the efficiency of energy consumption. And that explains well why the basic architecture is now changing at a much slower pace. Applied to Skylake, all this results in the fact that, compared to the previous generation, the IPC (the number of instructions executed per cycle ).
In fact, most of the changes are made at the expense of expanding the execution pipeline. The basic principles of Skylake's work in comparison with previous processors did not change. RISC micro instructions. RISC micro instructions. But, unlike its predecessors, the number of situations in which Skylake can boast of simultaneous execution of six received micro instructions, that is, the maximum efficiency.
To do this, the branch prediction blocks have been improved, and the possibilities for the extraordinary execution of instructions have been increased. However, about any structural alteration it is not a question. All improvements are accomplished by simply deepening the internal buffers. For example, the window size of the extraordinary performance grew from 192 instructions in Haswell to 224 in Skylake. Similarly, other buffers have grown up, due to what Skylake can work simultaneously on a lot of code. Thus, data buffers were increased, page faults and L2 cache misses were accelerated, and Hyper-Threading Technology became more efficient due to the increase in the size of the Reservation Stations.
Interesting changes affected the prefetch unit, Which aggressiveness was even reduced this time. Experience has shown that prefetching an excessive number of instructions can harm energy efficiency. Therefore, engineers made the choice in favor of saving energy, which can be spent on other stages of the executive conveyor or simply increase the clock frequency.
Unfortunately, Intel did not go into details about the changes in the heart of the microarchitecture – in executive devices. We have not even gotten to know of Haswell's number of performance ports. However, the company experts say that the result of the deep alterations was made in the FPU instructions. In addition, Intel released the information about the acceleration in the AES family. The performance gain of conventional encryption algorithms should be up to 33 percent in CBC mode and up to 17 percent in GCM mode.
I must say that in the history of Intel about Skylake, there were quite curious revelations that the server and client programs built on it. One example of such a difference is the well-known: the server of the AVX-512 commands, which will not be implemented in other processors. However, in a similar way, the situation may also be with some other extensions. In other words, when the server changes of Skylake appear on the market, this microarchitecture can open some new sides of itself.
But the innovations in the command system did not pass and the client processors. Intel, SGX (Software Guard Extension). Included in this set of commands allow the application to create for its execution an isolated and protected environment in memory, access to which will not be possible for any other processes and devices. Thus, an application that operates critical information can protect its code and data from any software and hardware attacks and intrusions. Intel emphasizes separately that it is possible to create a protected ITC-class hardware debuggers
Significant changes are concentrated in the Skylake microarchitecture and at a higher level – in the interaction of the processor blocks with each other and working with the data. First of all, the change in the algorithm of the L2-cache operation. Associativity has decreased by half compared to Haswell and Broadwell, this increased its speed, and the handling of misses now causes less delays.
But more significant innovations should be found even further from the processor cores. Skylake received a faster ring bus, which connects all the processor cores, L3 cache, memory controller, graphics core and system agent. According to the developers, the maximum bandwidth of the ring is doubled. However, it is also able to operate in the old, less fast mode, depending on which scenario. A slower bus can reduce power consumption and heat dissipation, but in the desktop versions of Skylake speed mode is used. In those of Skylake, where the emphasis is on performance, rather than on energy efficiency, by increasing the throughput of the ring, the cache of the third level.
Transformations in the caching system also affected by eDRAM buffer, which, starting with Haswell, is installed in some productive processor modifications. In Skylake, Intel plans to expand the scope of eDRAM, and with the focus on this, a whole set of optimizations is made at once. In Haswell and Broadwell processors, the additional buffer built-up to eDRAM, housed in a separate Crystalwell semiconductor chip to the processor core, could only coexist with the L3-cache with a capacity of 1.5 MB per core. At this time, eDRAM performed a 128-megabyte cache on the fourth level, which stores data that was superseded from the L3-cache. In Skylake this structure is broken: now the processor configurations with eDRAM will be able to have a level of cache with a capacity of 2 MB per core, and eDRAM has different capacity options: from 64 to 128 MB.
In commemoration of the changes that occurred, Intel even came up with a new name for eDRAM – Memory Side Cache (cache on the memory side). The basic idea is that up to now. EDRAM has been directly connected to the L3-cache, getting data from it which can not contain in it more. In the new processors, eDRAM does not interact with the processor cache, but with the memory controller. This means two things. First, now eDRAM is logically unconnected from the processor and the concern for maintaining its coherence has been removed from it. Secondly, cached in eDRAM can now absolutely any data arriving in the system memory, including those that are marked by the operating system as noncached, and even those that are exchanged with, for example, PCI Express devices or graphic core.
Such improvements look very interesting, but, apparently, Skylake variants with eDRAM, oriented to use in traditional desktop systems, will not. So all the benefits of the new scheme will be felt only by users of mobile systems.
⇡ # New approaches to saving energy
Willy-nilly to talk about energy efficiency you have to come back again. The desire to save energy has also been affected by the design of Skylake processors. And here we have developed both traditional approaches and some fundamentally new ideas.
First of all, it should be recalled that now. It was removed for reasons of economy – in the most energy-efficient CPU with a thermal package of about 4.5. By the way, in the future microarchitecture Intel intends to return the converter to the processor, but not in all versions.
The second fairly obvious innovation is that Intel engineers broke the processor for more than one of the inactivity. Now it has even reached separate executive devices. For example, in Skylake even 256-bit executive devices are responsible for the execution of AVX2 commands can be de-energized in the event of a downtime.
However, all this is by no means a new approach, similar techniques in one form or another have been used for a long time. Meanwhile, Skylake has a really revolutionary innovation – the technology of Speed Shift, the essence of which is own the energy-saving states.
Typically, modern processors can independently, that is, without the participation of the operating system, switch their frequency between the nominal state and turbo mode. However, the transition to economical states with reduced voltages and frequencies requires the direct participation of the OS. The commands to lower frequencies are given by it, having previously consulted the microprogram and figuring out which of the modes. As a result, switching to any economical state is a whole complex of activities, which takes a considerable time. Even worse is the way out of such regimes. The processor should inform the operating system that it has happened, that the system must process this process.
The introduction of Speed Shift gives the processor more independence. Yes, it retains its subordination to the operating system, which can translate it to a lower frequency, for example, to save energy in the end. But the routine issues of switching the energy-saving states of the processor now takes over completely, which significantly improves the response time and allows you to enter and exit energy-saving modes per millisecond. Reducing the response time to changing conditions should, on the one hand, serve the purpose of saving energy, and on the other hand it can also have a positive effect on productivity. In other words, Skylake processors with Intel Speed Shift technology will be able to choose the most suitable frequency of work based on the load assigned to them, and the switching of states will occur more accurately and more quickly.
It should be noted that Speed Shift also takes into account one more aspect that developers previously dealt with. Reducing the frequency to reduce power consumption does not always give the expected effect. The problem is that as the frequency falls below a certain boundary value, consumption starts to fall to a much lesser extent because of the greater influence of leakage currents. Therefore, in some energy-saving modes, it is more efficient to raise the processor frequency, quickly execute the necessary code, and then put the processor into sleep mode. This is the strategy used in Skylake, where special algorithms are introduced that can periodically send the processor to sleep state in deep energy-saving states and then wake it up to solve current low-priority tasks.
Speed Shift technology seems to be quite an interesting and relevant solution, however, unfortunately, its operation requires support from the operating system. At the moment it can provide only Windows 10. All the rest of the OC, including all possible variations of Linux or Android, Speed Shift support does not yet provide. However, Intel promises that in time this problem will somehow be solved.
In addition to what has been said, it should be added that Intel is working on the development of processor blocks with fixed functions that also allow saving energy. We'll talk a little more about the Skylake graphics core, but it's worth recalling that encoding and decoding video through Quick Sync capabilities, rather than processor cores, provides a good opportunity for energy saving. By the way, in Skylake this block has acquired new functions, and now the use of computational cores became optional even when decoding H.265 / HEVC-content. The proposed Intel open API for working with Quick Sync allows software developers to actively use this technology.
⇡ # Graphics grows
The role of graphical cores built into processors is increasing every year. And this is due not so much to the growth of their 3D performance, so much so that embedded GPUs take on all the new functions, such as parallel computing or encoding and decoding of multimedia content. The exception was not the graphic core Skylake. Intel refers it to the next, ninth generation, and this means that there are many surprises in it. However, it's worth starting with the fact that the GPU implemented in Skylake, like its predecessors, has retained the traditional modular design. Thus, we are again dealing with a whole family of solutions of different classes: based on existing building blocks of a new generation, Intel can collect cardinally different in performance GPU. This scalability is not a novelty in itself, but Skylake has increased not only the maximum performance, but also the number of available graphics core options.
So, the graphic core of Skylake can be built on the basis of one or several modules, each of which usually includes three sections. The sections are united into eight actuators, which are the main part of the processing of graphic data, and also contain basic blocks for working with memory and texture samplers. In addition to the actuators grouped in modules, the graphics core also contains a non-module part responsible for fixed geometric transformations and individual multimedia functions.
At the topmost level of the hierarchy, the Skylake graphics core is very similar to the core implemented in Broadwell. However, if you delve into the details, it is not difficult to find noticeable changes.
First, the non-modular part is now placed in a separate energy domain, which allows you to set the frequency and send it to sleep separately from the actuators. This means that, for example, when working with Quick Sync technology, which is implemented just by the forces of non-modular units, most of the GPU can be disconnected from the power lines in order to reduce power consumption. In addition, independent control of the frequency of the out-of-module part allows to better adjust its performance for the specific needs of the graphics core modules.
Secondly, while the Broadwell graphics core could only be based on one or two modules, receiving 24 or 48 actuators (energy-efficient and budget processors could use one module with disconnected partitions, which gave a smaller, than 24, the number of actuators), in Skylake can be applied from one to three modules.
Thanks to this, in addition to the usual GT1 / GT2 / GT3 configurations, the even more powerful GT4 core will be available in the Skylake family of processors, which will receive 72 executive devices.
However, the peak performance of the executive devices themselves in Skylake has not changed – each such device can perform up to 16 32-bit operations per clock. In this case, it is capable of performing 7 computational flows simultaneously and has 128 32-byte registers of general purpose.
Third, GT3 and GT4 core variants can be additionally enhanced with an eDRAM buffer of 64 or 128 MB, respectively, which gives GT3e and GT4e modifications. Broadwell processors were equipped with only one version of eDRAM – the volume of 128 MB. In Skylake, this additional buffer not only changed the algorithm of work, becoming a "cache on the memory side", but also acquired some configuration flexibility. However, its performance will remain old – it will be represented by a separate 22-nm crystal, mounted on the processor board next to the main chip.
Skylake's appearance of a stripped-down eDRAM chip with a capacity of 64 MB should expand the scope of GT3e graphics. Broadwell and Haswell processors, equipped with an additional buffer, had a high cost and were intended exclusively for productive laptops and desktop systems. The smaller eDRAM crystal should give life to more affordable Skylake variants with a powerful GPU that can be used, for example, in ultrabooks.
According to the data available to date, the graphics core of Skyklake will exist in six different versions, which will receive numerical indexes from the 500th series:
- HD Graphics 510 – GT1: one module, 12 actuators;
- HD Graphics 515 – GT1.5: one module, 18 executive devices;
- HD Graphics 530 – GT2: one module, 24 executive devices;
- HD Graphics 535 – GT3: two modules, 48 executive devices
- Iris Graphics 540 – GT3e: two modules, 48 executive devices and 64-MB eDRAM buffer
- Iris Pro Graphics 580 – GT4e: three modules, 72 executive devices and 128MB eDRAM buffer.
Наращивая мощность графического ядра, Intel проявила большую заботу и о том, чтобы для его нужд хватало пропускной способности памяти даже в конфигурациях, лишённых дополнительной eDRAM-памяти. С одной стороны, в Skylake обновился контроллер памяти, и теперь он способен работать с DDR4 SDRAM, частота и пропускная способность которой заметно выше, чем у DDR3 SDRAM. С другой стороны, в GPU появилось новая технология Lossless Render Target Compression («направленное на рендеринг сжатие без потерь»). Её суть заключается в том, что все данные, пересылаемые между GPU и системной памятью, которая одновременно является и видеопамятью, предварительно сжимаются, разгружая таким образом полосу пропускания. Применённый алгоритм использует компрессию без потерь, при этом степень сжатия данных может достигать двукратного размера. Несмотря на то, что всякая компрессия требует задействования дополнительных вычислительных ресурсов, инженеры Intel утверждают, что внедрение технологии Lossless Render Target Compression увеличивает быстродействие интегрированного GPU в реальных играх на величину от 3 до 11 процентов.
Упоминания заслуживают и некоторые другие усовершенствования в графическом ядре. Например, размеры собственной кеш-памяти в каждом модуле GPU были увеличены с 512 до 768 Кбайт. Благодаря этому, а также путём оптимизации архитектуры модулей разработчики смогли добиться почти двукратного улучшения скорости заполнения, что дало возможность не только поднять быстродействие GPU при включении полноэкранного сглаживания, но и добавить в число поддерживаемых режимов 16x MSAA.
Одним из основных ориентиров для встроенной в интеловский процессор графики давно выступает полноценная поддержка 4K-разрешений. Именно с таким прицелом Intel непрерывно увеличивает производительность GPU. Но в улучшении нуждается и другая часть – интерфейсные выходы. Нет ничего удивительного в том, что, подобно процессорам Broadwell, в графическом ядре Skylake поддерживается вывод 4K-изображения с частотой развёртки 60 Гц через DisplayPort 1.2 или Embedded DisplayPort 1.3, с частотой 24 Гц – через HDMI 1.4 и с частотой 30 Гц – по технологии Intel Wireless Display или по беспроводному протоколу Miracast. Но в Skylake к этому перечню добавилась и частичная поддержка HDMI 2.0, через который доступны 4K-разрешения с частотой развертки 60 Гц. Правда, для реализации этой возможности нужен некий дополнительный адаптер DisplayPort ↔ HDMI 2.0. Но зато передача сигнала HDMI 2.0 возможна в том числе и по интерфейсу Thunderbolt 3 в системах, имеющих соответствующий контроллер.
Так же как и раньше, GPU процессоров Skylake способен обеспечить вывод изображения на три экрана одновременно.
Нет ничего удивительного в том, что с ростом популярности новых форматов видео графическое ядро Skylake расширило возможности по его аппаратному кодированию и декодированию. Теперь средствами движка Quick Sync стало можно кодировать и декодировать контент в формате H.265/HEVC с 8-битной глубиной цвета, а с привлечением исполнительных устройств GPU – декодировать H.265/HEVC-видео и с 10-битным представлением цвета. К этому добавилась и полностью аппаратная поддержка кодирования в форматах JPEG и MJPEG.
Однако графика Skylake относится к новому, девятому поколению не в только силу перечисленных изменений. Главной причиной послужило то, что в ней сделаны существенные изменения в части поддерживаемых графических API. На данный момент в GPU новых процессоров есть совместимость с DirectX 12, OpenGL 4.4 и OpenCL 2.0, а позднее, по мере совершенствования графического драйвера, к этому списку добавятся будущие версии OpenCL 2.x и OpenGL 5.x, а также поддержка низкоуровневого фреймворка Vulkan. Здесь уместно упомянуть и о том, что в новом GPU реализована полноценная когерентность памяти с процессором, что делает Skylake самым настоящим APU – его графическое и вычислительные ядра могут одновременно работать над одной и той же задачей, используя общие данные.
При этом графическое ядро Skylake может предложить действительно неплохую вычислительную производительность. Работая на частоте 1,15 ГГц, один модуль GPU обеспечивает пиковое быстродействие на уровне 442 Гфлопс. Это значит, что GT4-версии графического ядра Skylake будут обладать теоретическим быстродействием порядка 1,15 Тфлорс, а это не только значительно превышает возможности любой существовавшей до сих пор интегрированной графики, но и приближается к показателям таких дискретных видеоускорителей, как GeForce GTX 750 или GeForce GTX 950M.