Testing L1TF patches: Virtual Desktops on VMware ESX
Last week Intel announced 3 new severe vulnerabilities in their processors allowing unauthorized access to the data in the L1-cache. They have been named L1 Terminal Fault or in short L1TF. By now I assume most IT-admins are aware of this, but the performance impact remains a mystery. That’s why we took it upon us to get you this information as soon as possible. Please note that these are initial findings. As we are doing more research and get more results we will constantly publish new updates.
|CVE-2018-3615||L1 Terminal Fault-SGX|
|CVE-2018-3620||L1 Terminal Fault-OS/ SMM|
|CVE-2018-3646||L1 Terminal Fault-VMM|
In this article I’m going to focus on the third variant (3646) as I expect it to have the biggest impact on the scalability and performance of Virtual Desktop environments such as VMware Horizon View and Citrix XenApp / XenDesktop. For those of you on AMD CPU’s there is good news, as they do not seem to be affected.
How can this leak be exploited? Simply said a malicious virtual machine (VM) running on a certain CPU core can access privileged information of another VM that is on the same CPU core at the same time and read its L1 Data Cache. This is possible because Intel processors share physically addressed L1 Data Cache across both logical processors of a Hyperthreading enabled core. When patching is not an option a quick way to mitigate this could be to disable hyperthreading although that might have a significant impact on performance or cause capacity issues and is therefore discouraged by VMware:“Disabling Intel Hyperthreading in firmware/BIOS (or by using VMkernel.Boot.Hyperthreading) after applying vSphere updates and patches is not recommended and precludes potential vSphere scheduler enhancements and mitigations that will allow the use of both logical processors. As such, disablement of hyperthreading to mitigate the Concurrent-context attack vector will introduce unnecessary operational overhead as hyperthreading may need to be re-enabled in the future.”
I have conducted my tests on VMware ESXi 6.5.0 Update 2 (Build 9298722) and my preliminary research is focused on the VMkernel.Boot.HyperthreadingMitigation setting that restricts the simultaneous use of logical processors from the same hyperthreaded core as necessary to mitigate a security vulnerability. This is the most reliable way to prevent exploits as all virtual machines are considered untrusted siblings. This level of security is needed e.g. in cloud desktop environments and high secure environments as financial institutions or hospitals. Meanwhile my college Tom is performing these tests on Citrix XenServer, but more on that later.
According to VMware it is safe to patch vCenter and/or the ESXi hosts as the mitigation is disabled by default. This is a great way to get ready for the next step: researching capacity issues. Naturally I am using the industry standard load testing solution Login VSI to simulate users on my environment. To start I have installed the patch but have left VMkernel.Boot.HyperthreadingMitigation to its default setting: False. A friendly message notifies me of this setting after the update is complete.
As my previous tests where on Server 2016 I decided to start measuring the impact on that platform as it would save some setup time. While testing bare RDSH machines it is expected that the relative impact will be similar on Citrix XenApp and Horizon shared session hosts but make sure to validate this in your own environments.
I have deployed 6 Windows server 2016 machines with 4vCPU’s and 55GB of memory resulting in an environment that could run 196 users before VSImax was hit. This is the maximum number of users that can work on an environment before performance becomes a bottleneck. Interestingly enabling HTMitigation did not impact performance too much, at first this had us wondering but discussions with performance experts quickly led us to conclude that the number of VM’s and vCPU’s simply allowed the hypervisor to work out a scenario where core’s where not shared.
|Windows Server 2016 (Default)||196||700||1677|
|Windows Server 2016 with HTM enabled||193||694||1678|
So, we changed the configuration, now running with 8 Server 2016 machines each with 32GB of memory and 6vCPU’s. This slightly lowered VSImax to 186 users.
|Windows Server 2016 (Default)||196||700||1677|
|Windows Server 2016 with HTM enabled||186||694||1532|
All right, knowing this it was time to step up the game and switch to Windows 10. I started out with a fresh copy of build 17134.1 (1803) with no further Windows updates and gave it a spin. In our lab we deployed a 180 VM’s al equipped with 2GB of memory and 2 vCPU’s and kicked of a test. As you can see the VSImax drops approximately 20%.
|Windows 10 with HTMitigation enabled||110||1004||1963|
As you can see the performance/density? hit is significant, there are however nuances: different operating systems, newer (or older) CPU’s and of course the applications and infrastructure in your own environment will be of influence on the exact impact. In addition: It also seems that the impact of the L1TF patch depends heavily on your configuration. When using RDS machines in an efficient way (when hyperthreading is fully utilized) the patch has minimal impact. However, we do see a bigger impact when you do utilize the Hyperthreading tech i.e. VDI.
Please note that these are the first results and we still have many questions remaining so updates are to be expected. If you like to get more info feel free to reach out, or if you’d like to test your own environment: Download your trial of Login VSI today.