Project update 9 of 15
In this update we explore the performance of virtualized Linux guests on an OpenPOWER Linux host with QEMU. Several tests are run, and all yield a somewhat surprising result — virtual machines actually provide a performance boost compared to native execution when the host SMT is set to 1! We suspect this is due to native host scheduling problems, but this also implies that there is considerable untapped potential latent within these OpenPOWER machines.
For all tests below, we use a Firestone reference server with dual 8-core 190 W CPUs, 4 Centaur memory buffers, and 256 GB RAM. We While the absolute numbers will change on a Talos machine, proportionally the numbers should be nearly identical when comparing native execution to the two virtualized modes.
OpenPOWER machines under KVM/QEMU have two separate virtualization modes available, "Hypervisor" (kvm-hv) and "Problem" (kvm-pr). The hypervisor mode uses the native virtualization extensions of POWER7 and greater CPUs, and provides the best possible peformance of any virtualization mode on POWER systems. However, this mode is limited to the host CPU generation or the prior CPU generation, and furthermore cannot be used from inside another virtual machine. In comparison, problem mode executes the virtual machine completely in user mode by utilizing the problem handlers of the POWER architecture, and emulates privileged instructions where needed. This virtualization mode can be used on any PPC / POWER hardware, can emulate any PPC / POWER CPU type or generation, and is suitable for nested virtualization, but carries a variable performance penalty based on workload.
One final variable is that POWER machines can be set to different SMT (Simultaneous MultiThreading) modes. POWER8 CPUs natively support 8 simultaneous threads (SMT 8), but some workloads (e.g. QEMU) require the native SMT support to be disabled (SMT 1). As a result, we benchmark the native SMT 8 performance alongside the native SMT 1 performance for direct comparison. It is hoped that over time, as QEMU on POWER matures further, this limitation can be removed.
Building on our previous kernel compilation tests, we ran timed compile tests on several native and virtualized configurations. As before, a snapshot of the Linux kernel source tree was pulled and compiled for POWER using the stock Debian configuration. The compilation took place entirely within a dedicated tmpfs mount. The command used to compile was:
time make -j<core count>
Native (SMT 8, 128 cores) | Native (SMT 1, 16 cores) | Virtualized HV (SMT 1, 16 cores) | Virtualized PR (SMT 1, 16 cores) | |
---|---|---|---|---|
Wall Time | 4m15.934s | 23m33.949s | 7m13.634s | 20m37.722s |
Also building on our previous memory bandwidth tests, we ran STREAM benchmarks on all four configurations. The command used to run the benchmark was:
OMP_NUM_THREADS=<core count> ./stream
Function Best Rate MB/s Avg time Min time Max time
Copy: 32822.1 0.026917 0.019499 0.041350
Scale: 35293.1 0.027499 0.018134 0.035020
Add: 45206.4 0.025632 0.021236 0.031831
Triad: 43533.4 0.025338 0.022052 0.029733
Function Best Rate MB/s Avg time Min time Max time
Copy: 47365.7 0.014247 0.013512 0.015773
Scale: 51871.6 0.013212 0.012338 0.014527
Add: 58472.5 0.018189 0.016418 0.027140
Triad: 60131.6 0.016697 0.015965 0.018448
Function Best Rate MB/s Avg time Min time Max time
Copy: 36221.7 0.019791 0.017669 0.022151
Scale: 32368.9 0.020795 0.019772 0.022489
Add: 38326.4 0.026114 0.025048 0.027989
Triad: 38551.0 0.026241 0.024902 0.027209
Function Best Rate MB/s Avg time Min time Max time
Copy: 34471.4 0.022185 0.018566 0.026645
Scale: 32199.6 0.022841 0.019876 0.028890
Add: 37231.0 0.029773 0.025785 0.035886
Triad: 39228.5 0.027465 0.024472 0.034118
Given the rather odd results shown above, a more comprehensive systemwide open-source benchmark was sought. Unix Bench gives detailed information on the speed of various system calls, process spawning, etc. and we ran this benchmark on all four of the test system configurations.
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: alsvidr: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up 1:26, 4 users, load average: 0.87, 0.52, 0.33; runlevel 2016-11-10
------------------------------------------------------------------------
128 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 28248374.7 lps (10.0 s, 7 samples)
Double-Precision Whetstone 3969.4 MWIPS (9.7 s, 7 samples)
Execl Throughput 1226.4 lps (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 593518.2 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 157303.0 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1914860.8 KBps (30.0 s, 2 samples)
Pipe Throughput 1406112.2 lps (10.0 s, 7 samples)
Pipe-based Context Switching 157185.1 lps (10.0 s, 7 samples)
Process Creation 6354.3 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 3976.9 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 1312.8 lpm (60.0 s, 2 samples)
System Call Overhead 1459471.8 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 28248374.7 2420.6
Double-Precision Whetstone 55.0 3969.4 721.7
Execl Throughput 43.0 1226.4 285.2
File Copy 1024 bufsize 2000 maxblocks 3960.0 593518.2 1498.8
File Copy 256 bufsize 500 maxblocks 1655.0 157303.0 950.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 1914860.8 3301.5
Pipe Throughput 12440.0 1406112.2 1130.3
Pipe-based Context Switching 4000.0 157185.1 393.0
Process Creation 126.0 6354.3 504.3
Shell Scripts (1 concurrent) 42.4 3976.9 937.9
Shell Scripts (8 concurrent) 6.0 1312.8 2188.1
System Call Overhead 15000.0 1459471.8 973.0
========
System Benchmarks Index Score 1003.9
------------------------------------------------------------------------
128 CPUs in system; running 128 parallel copies of tests
Dhrystone 2 using register variables 474697222.9 lps (10.1 s, 7 samples)
Double-Precision Whetstone 196647.6 MWIPS (9.7 s, 7 samples)
Execl Throughput 4955.9 lps (29.5 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 323175.8 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 80165.2 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1178499.2 KBps (30.1 s, 2 samples)
Pipe Throughput 38865990.2 lps (10.2 s, 7 samples)
Pipe-based Context Switching 4280137.2 lps (10.0 s, 7 samples)
Process Creation 69295.5 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 10687.0 lpm (60.3 s, 2 samples)
Shell Scripts (8 concurrent) 1066.3 lpm (64.8 s, 2 samples)
System Call Overhead 2407036.9 lps (10.3 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 474697222.9 40676.7
Double-Precision Whetstone 55.0 196647.6 35754.1
Execl Throughput 43.0 4955.9 1152.5
File Copy 1024 bufsize 2000 maxblocks 3960.0 323175.8 816.1
File Copy 256 bufsize 500 maxblocks 1655.0 80165.2 484.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 1178499.2 2031.9
Pipe Throughput 12440.0 38865990.2 31242.8
Pipe-based Context Switching 4000.0 4280137.2 10700.3
Process Creation 126.0 69295.5 5499.6
Shell Scripts (1 concurrent) 42.4 10687.0 2520.5
Shell Scripts (8 concurrent) 6.0 1066.3 1777.2
System Call Overhead 15000.0 2407036.9 1604.7
========
System Benchmarks Index Score 4019.7
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: alsvidr: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up 6:56, 5 users, load average: 0.91, 1.38, 1.00; runlevel 2016-11-09
------------------------------------------------------------------------
16 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 29813165.1 lps (10.0 s, 7 samples)
Double-Precision Whetstone 4052.8 MWIPS (9.7 s, 7 samples)
Execl Throughput 1236.7 lps (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 624721.1 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 161424.4 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1967152.4 KBps (30.0 s, 2 samples)
Pipe Throughput 1471144.3 lps (10.0 s, 7 samples)
Pipe-based Context Switching 181574.9 lps (10.0 s, 7 samples)
Process Creation 9996.8 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 4032.4 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 1456.7 lpm (60.0 s, 2 samples)
System Call Overhead 1498750.4 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 29813165.1 2554.7
Double-Precision Whetstone 55.0 4052.8 736.9
Execl Throughput 43.0 1236.7 287.6
File Copy 1024 bufsize 2000 maxblocks 3960.0 624721.1 1577.6
File Copy 256 bufsize 500 maxblocks 1655.0 161424.4 975.4
File Copy 4096 bufsize 8000 maxblocks 5800.0 1967152.4 3391.6
Pipe Throughput 12440.0 1471144.3 1182.6
Pipe-based Context Switching 4000.0 181574.9 453.9
Process Creation 126.0 9996.8 793.4
Shell Scripts (1 concurrent) 42.4 4032.4 951.0
Shell Scripts (8 concurrent) 6.0 1456.7 2427.9
System Call Overhead 15000.0 1498750.4 999.2
========
System Benchmarks Index Score 1088.8
------------------------------------------------------------------------
16 CPUs in system; running 16 parallel copies of tests
Dhrystone 2 using register variables 469625912.8 lps (10.0 s, 7 samples)
Double-Precision Whetstone 64079.7 MWIPS (9.7 s, 7 samples)
Execl Throughput 4840.5 lps (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 458129.4 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 122260.4 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1928127.3 KBps (30.0 s, 2 samples)
Pipe Throughput 23057509.5 lps (10.0 s, 7 samples)
Pipe-based Context Switching 1414615.2 lps (10.0 s, 7 samples)
Process Creation 75094.7 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 14131.7 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 1587.8 lpm (60.4 s, 2 samples)
System Call Overhead 3684855.9 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 469625912.8 40242.2
Double-Precision Whetstone 55.0 64079.7 11650.9
Execl Throughput 43.0 4840.5 1125.7
File Copy 1024 bufsize 2000 maxblocks 3960.0 458129.4 1156.9
File Copy 256 bufsize 500 maxblocks 1655.0 122260.4 738.7
File Copy 4096 bufsize 8000 maxblocks 5800.0 1928127.3 3324.4
Pipe Throughput 12440.0 23057509.5 18535.0
Pipe-based Context Switching 4000.0 1414615.2 3536.5
Process Creation 126.0 75094.7 5959.9
Shell Scripts (1 concurrent) 42.4 14131.7 3333.0
Shell Scripts (8 concurrent) 6.0 1587.8 2646.4
System Call Overhead 15000.0 3684855.9 2456.6
========
System Benchmarks Index Score 3908.1
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: libreoffice-build-vm: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up 3 min, 1 user, load average: 0.22, 0.06, 0.02; runlevel 2016-11-09
------------------------------------------------------------------------
16 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 29740611.7 lps (10.0 s, 7 samples)
Double-Precision Whetstone 4044.5 MWIPS (9.7 s, 7 samples)
Execl Throughput 2065.3 lps (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 492491.6 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 130002.0 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1608499.5 KBps (30.0 s, 2 samples)
Pipe Throughput 1521715.9 lps (10.0 s, 7 samples)
Pipe-based Context Switching 165060.7 lps (10.0 s, 7 samples)
Process Creation 4405.1 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 5817.9 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 2778.9 lpm (60.0 s, 2 samples)
System Call Overhead 1619580.0 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 29740611.7 2548.5
Double-Precision Whetstone 55.0 4044.5 735.4
Execl Throughput 43.0 2065.3 480.3
File Copy 1024 bufsize 2000 maxblocks 3960.0 492491.6 1243.7
File Copy 256 bufsize 500 maxblocks 1655.0 130002.0 785.5
File Copy 4096 bufsize 8000 maxblocks 5800.0 1608499.5 2773.3
Pipe Throughput 12440.0 1521715.9 1223.2
Pipe-based Context Switching 4000.0 165060.7 412.7
Process Creation 126.0 4405.1 349.6
Shell Scripts (1 concurrent) 42.4 5817.9 1372.2
Shell Scripts (8 concurrent) 6.0 2778.9 4631.5
System Call Overhead 15000.0 1619580.0 1079.7
========
System Benchmarks Index Score 1094.4
------------------------------------------------------------------------
16 CPUs in system; running 16 parallel copies of tests
Dhrystone 2 using register variables 465404814.5 lps (10.0 s, 7 samples)
Double-Precision Whetstone 63812.2 MWIPS (9.7 s, 7 samples)
Execl Throughput 15151.1 lps (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 384508.0 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 87708.3 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 1554224.1 KBps (30.0 s, 2 samples)
Pipe Throughput 23429940.5 lps (10.0 s, 7 samples)
Pipe-based Context Switching 2449227.9 lps (10.0 s, 7 samples)
Process Creation 25233.1 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 49705.7 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 7294.8 lpm (60.1 s, 2 samples)
System Call Overhead 3708419.3 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 465404814.5 39880.4
Double-Precision Whetstone 55.0 63812.2 11602.2
Execl Throughput 43.0 15151.1 3523.5
File Copy 1024 bufsize 2000 maxblocks 3960.0 384508.0 971.0
File Copy 256 bufsize 500 maxblocks 1655.0 87708.3 530.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 1554224.1 2679.7
Pipe Throughput 12440.0 23429940.5 18834.4
Pipe-based Context Switching 4000.0 2449227.9 6123.1
Process Creation 126.0 25233.1 2002.6
Shell Scripts (1 concurrent) 42.4 49705.7 11723.0
Shell Scripts (8 concurrent) 6.0 7294.8 12158.1
System Call Overhead 15000.0 3708419.3 2472.3
========
System Benchmarks Index Score 4881.2
========================================================================
BYTE UNIX Benchmarks (Version 5.1.3)
System: libreoffice-build-vm: GNU/Linux
OS: GNU/Linux -- 4.8.0-trunk-powerpc64le -- #1 SMP Debian 4.8.4-1~exp1 (2016-10-23)
Machine: ppc64le (unknown)
Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
up 0 min, 1 user, load average: 0.88, 0.28, 0.10; runlevel 2016-11-10
------------------------------------------------------------------------
16 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 29598703.7 lps (10.0 s, 7 samples)
Double-Precision Whetstone 4029.0 MWIPS (9.7 s, 7 samples)
Execl Throughput 249.5 lps (29.4 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 35533.1 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 9273.0 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 144204.7 KBps (30.0 s, 2 samples)
Pipe Throughput 43923.1 lps (10.0 s, 7 samples)
Pipe-based Context Switching 10920.2 lps (10.0 s, 7 samples)
Process Creation 594.1 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 1078.5 lpm (60.1 s, 2 samples)
Shell Scripts (8 concurrent) 316.6 lpm (60.1 s, 2 samples)
System Call Overhead 32725.4 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 29598703.7 2536.3
Double-Precision Whetstone 55.0 4029.0 732.6
Execl Throughput 43.0 249.5 58.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 35533.1 89.7
File Copy 256 bufsize 500 maxblocks 1655.0 9273.0 56.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 144204.7 248.6
Pipe Throughput 12440.0 43923.1 35.3
Pipe-based Context Switching 4000.0 10920.2 27.3
Process Creation 126.0 594.1 47.1
Shell Scripts (1 concurrent) 42.4 1078.5 254.4
Shell Scripts (8 concurrent) 6.0 316.6 527.7
System Call Overhead 15000.0 32725.4 21.8
========
System Benchmarks Index Score 127.2
------------------------------------------------------------------------
16 CPUs in system; running 16 parallel copies of tests
Dhrystone 2 using register variables 464272669.8 lps (10.0 s, 7 samples)
Double-Precision Whetstone 63585.0 MWIPS (9.7 s, 7 samples)
Execl Throughput 1195.5 lps (29.8 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 179139.5 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 42037.4 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 713253.6 KBps (30.0 s, 2 samples)
Pipe Throughput 676627.3 lps (10.0 s, 7 samples)
Pipe-based Context Switching 125727.5 lps (10.1 s, 7 samples)
Process Creation 2225.8 lps (30.1 s, 2 samples)
Shell Scripts (1 concurrent) 3361.9 lpm (60.2 s, 2 samples)
Shell Scripts (8 concurrent) 412.6 lpm (61.1 s, 2 samples)
System Call Overhead 504498.4 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 464272669.8 39783.4
Double-Precision Whetstone 55.0 63585.0 11560.9
Execl Throughput 43.0 1195.5 278.0
File Copy 1024 bufsize 2000 maxblocks 3960.0 179139.5 452.4
File Copy 256 bufsize 500 maxblocks 1655.0 42037.4 254.0
File Copy 4096 bufsize 8000 maxblocks 5800.0 713253.6 1229.7
Pipe Throughput 12440.0 676627.3 543.9
Pipe-based Context Switching 4000.0 125727.5 314.3
Process Creation 126.0 2225.8 176.6
Shell Scripts (1 concurrent) 42.4 3361.9 792.9
Shell Scripts (8 concurrent) 6.0 412.6 687.6
System Call Overhead 15000.0 504498.4 336.3
========
System Benchmarks Index Score 825.4
As before, the highest performance is attained within the kvm-hv virtual machine, which still exceeds native performance. The kvm-pr virtual machine performs far worse than expected, only reaching 11.6% of the kvm-hv performance in these kernel operation -heavy tests.
The results do shed some light on the performance increase inside a kvm-hv virtual machine, however. It appears that system call overhead is greatly reduced inside the kvm-hv virtual machine as compared to native exection, including execl(), and this would easily explain the observed results for the timed compilation tests. Furthermore, disabling SMT produces a puzzling, massive drop in timed compile performance, but this drop is not reflected in the Unix Bench results above. Overall, these test results hint the Linux kernel may not be properly tuned for native execution, and that our prior benchmarks on the campaign page and in the updates are likely significantly under-reporting OpenPOWER’s true performance limits. We will be forwarding these results to IBM for further analysis and hopefully a fix that unlocks more of OpenPOWER’s true potential!