Solved: Weird DGEMM with i9 14900k

tat0k · ‎05-20-2025

Hello i was like benchmarking inverse matrix on several device i had for civil engineering finite element jobs that usually fortran based. I got fastest pc last year i9 14900k. but somehow the inverse benchmark was unusually up and down, like not steady, even though not throttling. So i am curious to test DGEMM benchmark that i made from the AI. I was compiling with Intel(R) oneAPI Compiler for applications running on Intel(R) 64, Version 2025.0.4 Build 20241205. I created a benchmarked DGEMM like this. if it wrong please comment, since it was made by AI.

the code is fortran_dgemm2.f90. i was compiled with ifx fortran_dgemm2.f90 /Qipo /Qmkl:parallel /O3 /Qxhost.

The results was :

C:\temp\dgemm>fortran_dgemm2.exe
Fortran DGEMM Benchmark (k = 5000 , using BLAS DGEMM - MKL expected)
Kompiler: Intel Fortran (ifort)
Timer: SYSTEM_CLOCK (Wall-Clock Time)
---------------------------------------------
m=n | GFLOPS | Time (s)
---------------------------------------------
2000 | 800.0000 | 0.050000
4000 | 683.7607 | 0.234000
6000 | 787.7462 | 0.457000
8000 | 87.6352 | 7.303000
10000 | 70.5169 | 14.181000
12000 | 78.9733 | 18.234000
14000 | 85.9084 | 22.815000

As you can see, after 8000 matrix size, it became slowdown. the photo below is using roofline, another benchmark that combine stream and dgemm benchmark, same symptom.

i was tried using C source also it has same value.

C:\temp\dgemm>cplusdgemm.exe
C DGEMM Benchmark (k = 5000, using CBLAS - MKL expected with /Qmkl)
Kompiler: Intel C++ (icl)
--------------------------------------------------
  m(=N)  |  n(=M)  |   GFLOPS   | Time (s)
--------------------------------------------------
  2000 |   2000 |   775.7606 | 0.051562
  4000 |   4000 |   689.0291 | 0.232211
  6000 |   6000 |   790.1226 | 0.455626
  8000 |   8000 |    95.1756 | 6.724411

half threads was used

Then i downloaded ifx and openblas.

C:\temp\dgemm>ifx fortran_dgemm2.f90 /Qipo /Qopenmp ..\gcc\libopenblas.dll.a -o fortran_dgemm2_openblas.exe
Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2025.0.4 Build 20241205
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.


C:\temp\dgemm>copy ..\gcc\libopenblas.dll
        1 file(s) copied.

C:\temp\dgemm>fortran_dgemm2_openblas.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
  2000 |   606.0606 |   0.066000
  4000 |   629.9213 |   0.254000
  6000 |   666.6667 |   0.540000
  8000 |   330.4078 |   1.937000
 10000 |   338.0663 |   2.958000
 12000 |   334.1067 |   4.310000
 14000 |   332.0908 |   5.902000
 16000 |   333.3333 |   7.680000
 18000 |   333.2648 |   9.722000
 20000 |   333.8898 |  11.980000
 ---------------------------------------------

clearly after 8000, it half of DGEMM values at smaller matrix, and i9 also worked at half threads. it is using like 16 threads even though i set set OMP_NUM_THREADS=32 at 8000 n matrix.

I was quite confused, where i must put this thread, in processor forum, intel fortran forum, or mkl forum. please move the thread according to the right it should be.

1. Please checked the code dgemm i used, did I wrong?

2. Why the DGEMM got half threads if it bigger matrix in i9? Did i had faulty i9 14900k?

3. MKL seems very low here than openblas. maybe there is an update.

4. Does this symptom (lower DGEMM values at big matrix) is actually usual happen?

Regards, to the OneAPI team.

Ruqiu_C_Intel · ‎06-03-2025

Hi ,

The DGEMM performance issue you encounter are mainly due to the following reasons:

Hybrid architecture thread scheduling issues. MKL will try to use all available cores (P cores + E cores) by default, but the single-thread performance of the E core is low. In large-scale memory-intensive tasks, resource contention between P cores/E cores (such as shared LLC, memory bandwidth) will cause performance degradation.

The thread scheduling of the operating system and BLAS library may not be able to intelligently distinguish between P cores/E cores, resulting in some threads being assigned to E cores which reduce overall efficiency.

When large matrix operations are performed, memory bandwidth becomes the main bottleneck. The increase in E cores will not increase bandwidth, but may increase bandwidth pressure due to contention, resulting in performance degradation.

We recommend that you try to bind threads to P cores through affinity (such as KMP_AFFINITY) to further improve consistency.

Regards,

Ruqiu

View solution in original post

JohnNichols · ‎05-22-2025

ASK on MKL forum, they are a kinder crowd of people.

Ron_Green · ‎05-22-2025

You could try /Qmkl:sequential

you could try

set OMP_NUM_THREADS=1

and then 2, 4, 8, 16

also try

set KMP_AFFINITY=verbose

and run. check processor binding. you're probably using hypercores by default. See the above for setting OMP_NUM_THREADS to the number of physical core ( 1/2 the PROCESSORS )

tat0k · ‎05-23-2025

The program actually single thread DGEMM benchmark, no openmp directives inside it, and so the parallel BLAS/LAPACK will introduce hyper threading solution, so i was not set any variable OMP_NUM_THREADS.

After setting KMP_AFFINITY = verbose

C:\temp\dgemm>set OMP_NUM_THREADS
Environment variable OMP_NUM_THREADS not defined

C:\temp\dgemm>set KMP_AFFINITY=verbose

C:\temp\dgemm>fortran_dgemm2a.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 32 available OS procs
OMP: Info #159: KMP_AFFINITY: Nonuniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 12 L2 caches/socket x 4 cores/L2 cache x 2 threads/core (24 total cores)
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 8 Intel(R) Core(TM) processor cores.
OMP: Info #195: KMP_AFFINITY:   8 with core efficiency 1.
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 16 Intel Atom(R) processor cores.
OMP: Info #195: KMP_AFFINITY:   16 with core efficiency 0.
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 L2 cache 0 core 0 thread 0 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 L2 cache 0 core 0 thread 1 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 L2 cache 8 core 1 thread 2 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 L2 cache 8 core 1 thread 3 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 L2 cache 16 core 2 thread 4 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 L2 cache 16 core 2 thread 5 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 L2 cache 24 core 3 thread 6 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 L2 cache 24 core 3 thread 7 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 8 maps to socket 0 L2 cache 32 core 4 thread 8 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 9 maps to socket 0 L2 cache 32 core 4 thread 9 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 10 maps to socket 0 L2 cache 40 core 5 thread 10 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 11 maps to socket 0 L2 cache 40 core 5 thread 11 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 12 maps to socket 0 L2 cache 48 core 6 thread 12 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 13 maps to socket 0 L2 cache 48 core 6 thread 13 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 14 maps to socket 0 L2 cache 56 core 7 thread 14 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 15 maps to socket 0 L2 cache 56 core 7 thread 15 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 16 maps to socket 0 L2 cache 64 core 8 thread 16 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 17 maps to socket 0 L2 cache 64 core 9 thread 17 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 0 L2 cache 64 core 10 thread 18 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 0 L2 cache 64 core 11 thread 19 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 L2 cache 72 core 12 thread 20 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 21 maps to socket 0 L2 cache 72 core 13 thread 21 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 22 maps to socket 0 L2 cache 72 core 14 thread 22 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 23 maps to socket 0 L2 cache 72 core 15 thread 23 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 0 L2 cache 80 core 16 thread 24 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 0 L2 cache 80 core 17 thread 25 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 26 maps to socket 0 L2 cache 80 core 18 thread 26 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 27 maps to socket 0 L2 cache 80 core 19 thread 27 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 28 maps to socket 0 L2 cache 88 core 20 thread 28 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 29 maps to socket 0 L2 cache 88 core 21 thread 29 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 0 L2 cache 88 core 22 thread 30 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 31 maps to socket 0 L2 cache 88 core 23 thread 31 (Intel Atom(R) processor)
OMP: Info #255: KMP_AFFINITY: pid 556 tid 25368 thread 0 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 5504 thread 1 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 14536 thread 2 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 4832 thread 3 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 13408 thread 4 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 11824 thread 5 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 24152 thread 6 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 2740 thread 7 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 19424 thread 8 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 23416 thread 9 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 25944 thread 10 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 11580 thread 11 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 25916 thread 12 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 23204 thread 13 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 18720 thread 14 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 25556 thread 15 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 25848 thread 16 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 21248 thread 17 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 10128 thread 18 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 19876 thread 19 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 23704 thread 20 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 14356 thread 21 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 11656 thread 22 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 556 tid 17844 thread 23 bound to OS proc set 0-31
  2000 |   754.7170 |   0.053000
  4000 |   680.8511 |   0.235000
  6000 |   791.2088 |   0.455000
  8000 |   103.9467 |   6.157000
 10000 |    77.2380 |  12.947000
 12000 |    81.6975 |  17.626000
 14000 |   123.4801 |  15.873000
 16000 |   101.5510 |  25.209000
 18000 |    80.9474 |  40.026000
 20000 |   103.9555 |  38.478000
 ---------------------------------------------

Using MKL sequential. making it single thread.

C:\temp\dgemm>fortran_dgemm_mklseq.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
  2000 |    80.8081 |   0.495000
  4000 |    11.5382 |  13.867000
  6000 |    12.7240 |  28.293000
  8000 |    10.8894 |  58.773000
 10000 |    11.2529 |  88.866000
 12000 |    12.0197 | 119.803000

i will post later using OMP_NUM_THREADS and BLAS_NUM_THREADS after, my work calculation finished.

tat0k · ‎05-23-2025

at the original post, i dont have set omp_num_threads since the code it self doesnt have openmp directive, and not compiled with /Qopenmp. i guess it memory bandwidth problem. MKL_NUM_THREADS is kind solving the problem to behave slightly better than Openblas, not having big drop.

1. when use /MKL:sequential is so slow.

C:\temp\dgemm>fortran_dgemm_mklseq.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
  2000 |    80.8081 |   0.495000
  4000 |    11.5382 |  13.867000
  6000 |    12.7240 |  28.293000
  8000 |    10.8894 |  58.773000
 10000 |    11.2529 |  88.866000
 12000 |    12.0197 | 119.803000

2. when OMP_NUM_THREADS=1 and 2 until 32, the results are same for each OMP_NUM_THREADS. 4 tid threads were activated. still a bit slow.

C:\temp\dgemm>set set OMP_NUM_THREADS=1

C:\temp\dgemm>fortran_dgemm_mklpar.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 32 available OS procs
OMP: Info #159: KMP_AFFINITY: Nonuniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 12 L2 caches/socket x 4 cores/L2 cache x 2 threads/core (24 total cores)
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 8 Intel(R) Core(TM) processor cores.
OMP: Info #195: KMP_AFFINITY:   8 with core efficiency 1.
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 16 Intel Atom(R) processor cores.
OMP: Info #195: KMP_AFFINITY:   16 with core efficiency 0.
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 L2 cache 0 core 0 thread 0 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 L2 cache 0 core 0 thread 1 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 L2 cache 8 core 1 thread 2 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 L2 cache 8 core 1 thread 3 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 L2 cache 16 core 2 thread 4 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 L2 cache 16 core 2 thread 5 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 L2 cache 24 core 3 thread 6 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 L2 cache 24 core 3 thread 7 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 8 maps to socket 0 L2 cache 32 core 4 thread 8 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 9 maps to socket 0 L2 cache 32 core 4 thread 9 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 10 maps to socket 0 L2 cache 40 core 5 thread 10 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 11 maps to socket 0 L2 cache 40 core 5 thread 11 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 12 maps to socket 0 L2 cache 48 core 6 thread 12 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 13 maps to socket 0 L2 cache 48 core 6 thread 13 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 14 maps to socket 0 L2 cache 56 core 7 thread 14 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 15 maps to socket 0 L2 cache 56 core 7 thread 15 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 16 maps to socket 0 L2 cache 64 core 8 thread 16 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 17 maps to socket 0 L2 cache 64 core 9 thread 17 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 0 L2 cache 64 core 10 thread 18 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 0 L2 cache 64 core 11 thread 19 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 L2 cache 72 core 12 thread 20 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 21 maps to socket 0 L2 cache 72 core 13 thread 21 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 22 maps to socket 0 L2 cache 72 core 14 thread 22 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 23 maps to socket 0 L2 cache 72 core 15 thread 23 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 0 L2 cache 80 core 16 thread 24 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 0 L2 cache 80 core 17 thread 25 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 26 maps to socket 0 L2 cache 80 core 18 thread 26 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 27 maps to socket 0 L2 cache 80 core 19 thread 27 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 28 maps to socket 0 L2 cache 88 core 20 thread 28 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 29 maps to socket 0 L2 cache 88 core 21 thread 29 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 0 L2 cache 88 core 22 thread 30 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 31 maps to socket 0 L2 cache 88 core 23 thread 31 (Intel Atom(R) processor)
OMP: Info #255: KMP_AFFINITY: pid 18708 tid 18980 thread 0 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 18708 tid 18920 thread 1 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 18708 tid 18664 thread 3 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 18708 tid 2324 thread 2 bound to OS proc set 0-31
  2000 |   330.5785 |   0.121000
  4000 |   342.6124 |   0.467000
  6000 |   121.5395 |   2.962000
  8000 |   121.5344 |   5.266000
 10000 |   119.5172 |   8.367000
 12000 |   119.9300 |  12.007000
 14000 |   119.4466 |  16.409000
 16000 |   119.8053 |  21.368000
 18000 |   119.4206 |  27.131000
 20000 |   120.2212 |  33.272000
 ---------------------------------------------

C:\temp\dgemm>set set OMP_NUM_THREADS=2

C:\temp\dgemm>fortran_dgemm_mklpar.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 32 available OS procs
OMP: Info #159: KMP_AFFINITY: Nonuniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 12 L2 caches/socket x 4 cores/L2 cache x 2 threads/core (24 total cores)
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 8 Intel(R) Core(TM) processor cores.
OMP: Info #195: KMP_AFFINITY:   8 with core efficiency 1.
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 16 Intel Atom(R) processor cores.
OMP: Info #195: KMP_AFFINITY:   16 with core efficiency 0.
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 L2 cache 0 core 0 thread 0 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 L2 cache 0 core 0 thread 1 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 L2 cache 8 core 1 thread 2 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 L2 cache 8 core 1 thread 3 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 L2 cache 16 core 2 thread 4 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 L2 cache 16 core 2 thread 5 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 L2 cache 24 core 3 thread 6 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 L2 cache 24 core 3 thread 7 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 8 maps to socket 0 L2 cache 32 core 4 thread 8 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 9 maps to socket 0 L2 cache 32 core 4 thread 9 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 10 maps to socket 0 L2 cache 40 core 5 thread 10 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 11 maps to socket 0 L2 cache 40 core 5 thread 11 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 12 maps to socket 0 L2 cache 48 core 6 thread 12 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 13 maps to socket 0 L2 cache 48 core 6 thread 13 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 14 maps to socket 0 L2 cache 56 core 7 thread 14 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 15 maps to socket 0 L2 cache 56 core 7 thread 15 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 16 maps to socket 0 L2 cache 64 core 8 thread 16 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 17 maps to socket 0 L2 cache 64 core 9 thread 17 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 0 L2 cache 64 core 10 thread 18 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 0 L2 cache 64 core 11 thread 19 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 L2 cache 72 core 12 thread 20 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 21 maps to socket 0 L2 cache 72 core 13 thread 21 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 22 maps to socket 0 L2 cache 72 core 14 thread 22 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 23 maps to socket 0 L2 cache 72 core 15 thread 23 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 0 L2 cache 80 core 16 thread 24 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 0 L2 cache 80 core 17 thread 25 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 26 maps to socket 0 L2 cache 80 core 18 thread 26 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 27 maps to socket 0 L2 cache 80 core 19 thread 27 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 28 maps to socket 0 L2 cache 88 core 20 thread 28 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 29 maps to socket 0 L2 cache 88 core 21 thread 29 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 0 L2 cache 88 core 22 thread 30 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 31 maps to socket 0 L2 cache 88 core 23 thread 31 (Intel Atom(R) processor)
OMP: Info #255: KMP_AFFINITY: pid 4632 tid 25972 thread 0 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 4632 tid 15468 thread 1 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 4632 tid 23156 thread 2 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 4632 tid 20744 thread 3 bound to OS proc set 0-31
  2000 |   330.5785 |   0.121000
  4000 |   338.9831 |   0.472000
  6000 |   119.4030 |   3.015000
  8000 |   119.7381 |   5.345000
 10000 |   119.5457 |   8.365000
 12000 |   120.1001 |  11.990000
 14000 |   119.7020 |  16.374000
 16000 |   119.5982 |  21.405000
 18000 |   119.6543 |  27.078000
 20000 |   119.8358 |  33.379000
 ---------------------------------------------

3. when using MKL_NUM_THREADS=32 . 32 threads tid mkl were activated. and sudden drop happened. at what happened is same as if we don't use MKL_NUM_THREADS at all as same as original posting. its like automatic to the MKL itself.

C:\temp\dgemm>set MKL_NUM_THREADS=32

C:\temp\dgemm>fortran_dgemm_mklpar.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 32 available OS procs
OMP: Info #159: KMP_AFFINITY: Nonuniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 12 L2 caches/socket x 4 cores/L2 cache x 2 threads/core (24 total cores)
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 8 Intel(R) Core(TM) processor cores.
OMP: Info #195: KMP_AFFINITY:   8 with core efficiency 1.
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 16 Intel Atom(R) processor cores.
OMP: Info #195: KMP_AFFINITY:   16 with core efficiency 0.
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 L2 cache 0 core 0 thread 0 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 L2 cache 0 core 0 thread 1 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 L2 cache 8 core 1 thread 2 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 L2 cache 8 core 1 thread 3 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 L2 cache 16 core 2 thread 4 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 L2 cache 16 core 2 thread 5 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 L2 cache 24 core 3 thread 6 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 L2 cache 24 core 3 thread 7 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 8 maps to socket 0 L2 cache 32 core 4 thread 8 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 9 maps to socket 0 L2 cache 32 core 4 thread 9 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 10 maps to socket 0 L2 cache 40 core 5 thread 10 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 11 maps to socket 0 L2 cache 40 core 5 thread 11 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 12 maps to socket 0 L2 cache 48 core 6 thread 12 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 13 maps to socket 0 L2 cache 48 core 6 thread 13 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 14 maps to socket 0 L2 cache 56 core 7 thread 14 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 15 maps to socket 0 L2 cache 56 core 7 thread 15 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 16 maps to socket 0 L2 cache 64 core 8 thread 16 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 17 maps to socket 0 L2 cache 64 core 9 thread 17 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 0 L2 cache 64 core 10 thread 18 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 0 L2 cache 64 core 11 thread 19 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 L2 cache 72 core 12 thread 20 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 21 maps to socket 0 L2 cache 72 core 13 thread 21 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 22 maps to socket 0 L2 cache 72 core 14 thread 22 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 23 maps to socket 0 L2 cache 72 core 15 thread 23 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 0 L2 cache 80 core 16 thread 24 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 0 L2 cache 80 core 17 thread 25 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 26 maps to socket 0 L2 cache 80 core 18 thread 26 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 27 maps to socket 0 L2 cache 80 core 19 thread 27 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 28 maps to socket 0 L2 cache 88 core 20 thread 28 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 29 maps to socket 0 L2 cache 88 core 21 thread 29 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 0 L2 cache 88 core 22 thread 30 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 31 maps to socket 0 L2 cache 88 core 23 thread 31 (Intel Atom(R) processor)
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 7884 thread 0 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 19136 thread 1 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 15804 thread 2 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 25680 thread 12 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 22948 thread 4 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 18804 thread 5 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 15896 thread 6 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 21340 thread 7 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 22216 thread 8 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 19012 thread 10 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 18980 thread 11 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 24160 thread 3 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 20196 thread 9 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 26160 thread 13 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 11064 thread 14 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 9540 thread 15 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 18748 thread 16 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 22952 thread 17 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 15452 thread 18 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 11540 thread 19 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 22420 thread 20 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 15744 thread 21 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 2260 thread 22 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 25628 tid 2608 thread 23 bound to OS proc set 0-31
  2000 |   800.0000 |   0.050000
  4000 |   686.6953 |   0.233000
  6000 |   803.5714 |   0.448000
  8000 |    92.2988 |   6.934000
 10000 |    79.2142 |  12.624000
 12000 |    85.3182 |  16.878000
 14000 |    92.2179 |  21.254000
 16000 |    99.9492 |  25.613000
 18000 |    89.7557 |  36.098000
 20000 |    88.6152 |  45.139000
 ---------------------------------------------

but when MKL_NUM_THREADS=16 then it faster then openblas a little bit. drop is happened but not so much after 8000.

C:\temp\dgemm>set MKL_NUM_THREADS=16

C:\temp\dgemm>fortran_dgemm_mklpar.exe
 Fortran DGEMM Benchmark (k =         5000 , using BLAS DGEMM - MKL expected)
 Kompiler: Intel Fortran (ifort)
 Timer: SYSTEM_CLOCK (Wall-Clock Time)
 ---------------------------------------------
   m=n  |   GFLOPS   |   Time (s)
 ---------------------------------------------
OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 0-31
OMP: Info #217: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #157: KMP_AFFINITY: 32 available OS procs
OMP: Info #159: KMP_AFFINITY: Nonuniform topology
OMP: Info #288: KMP_AFFINITY: topology layer "NUMA domain" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".
OMP: Info #288: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".
OMP: Info #192: KMP_AFFINITY: 1 socket x 12 L2 caches/socket x 4 cores/L2 cache x 2 threads/core (24 total cores)
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 8 Intel(R) Core(TM) processor cores.
OMP: Info #195: KMP_AFFINITY:   8 with core efficiency 1.
OMP: Info #194: KMP_AFFINITY: Intel(R) Hybrid Technology core type detected: 16 Intel Atom(R) processor cores.
OMP: Info #195: KMP_AFFINITY:   16 with core efficiency 0.
OMP: Info #219: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #172: KMP_AFFINITY: OS proc 0 maps to socket 0 L2 cache 0 core 0 thread 0 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 1 maps to socket 0 L2 cache 0 core 0 thread 1 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 L2 cache 8 core 1 thread 2 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 L2 cache 8 core 1 thread 3 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 4 maps to socket 0 L2 cache 16 core 2 thread 4 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 5 maps to socket 0 L2 cache 16 core 2 thread 5 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 6 maps to socket 0 L2 cache 24 core 3 thread 6 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 7 maps to socket 0 L2 cache 24 core 3 thread 7 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 8 maps to socket 0 L2 cache 32 core 4 thread 8 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 9 maps to socket 0 L2 cache 32 core 4 thread 9 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 10 maps to socket 0 L2 cache 40 core 5 thread 10 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 11 maps to socket 0 L2 cache 40 core 5 thread 11 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 12 maps to socket 0 L2 cache 48 core 6 thread 12 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 13 maps to socket 0 L2 cache 48 core 6 thread 13 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 14 maps to socket 0 L2 cache 56 core 7 thread 14 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 15 maps to socket 0 L2 cache 56 core 7 thread 15 (Intel(R) Core(TM) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 16 maps to socket 0 L2 cache 64 core 8 thread 16 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 17 maps to socket 0 L2 cache 64 core 9 thread 17 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 18 maps to socket 0 L2 cache 64 core 10 thread 18 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 19 maps to socket 0 L2 cache 64 core 11 thread 19 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 20 maps to socket 0 L2 cache 72 core 12 thread 20 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 21 maps to socket 0 L2 cache 72 core 13 thread 21 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 22 maps to socket 0 L2 cache 72 core 14 thread 22 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 23 maps to socket 0 L2 cache 72 core 15 thread 23 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 0 L2 cache 80 core 16 thread 24 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 0 L2 cache 80 core 17 thread 25 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 26 maps to socket 0 L2 cache 80 core 18 thread 26 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 27 maps to socket 0 L2 cache 80 core 19 thread 27 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 28 maps to socket 0 L2 cache 88 core 20 thread 28 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 29 maps to socket 0 L2 cache 88 core 21 thread 29 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 0 L2 cache 88 core 22 thread 30 (Intel Atom(R) processor)
OMP: Info #172: KMP_AFFINITY: OS proc 31 maps to socket 0 L2 cache 88 core 23 thread 31 (Intel Atom(R) processor)
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 21928 thread 0 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 15400 thread 2 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 18944 thread 1 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 11552 thread 3 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 20044 thread 6 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 15668 thread 7 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 20352 thread 4 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 4784 thread 5 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 16272 thread 8 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 21092 thread 9 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 2892 thread 10 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 25412 thread 11 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 5736 thread 12 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 14276 thread 13 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 13620 thread 14 bound to OS proc set 0-31
OMP: Info #255: KMP_AFFINITY: pid 20916 tid 24036 thread 15 bound to OS proc set 0-31
  2000 |   714.2857 |   0.056000
  4000 |   658.4362 |   0.243000
  6000 |   659.3407 |   0.546000
  8000 |   418.3007 |   1.530000
 10000 |   417.8855 |   2.393000
 12000 |   418.1185 |   3.444000
 14000 |   417.8214 |   4.691000
 16000 |   417.2099 |   6.136000
 18000 |   417.7411 |   7.756000
 20000 |   419.1554 |   9.543000
 ---------------------------------------------

so for intel p and efficient core the best thing is make sure mkl_num_threads=sum original p core threads. what i don't get it are

1. why in the first small matrix is quite big, and sudden drop after getting a big matrix size. is it because memory bandwidth. Because usually in older chip the matrix is size getting bigger, the gflops will be getting bigger untuk it will stagnant flat reaching max cpu computation flops. in this case is big in the small matrix and slow down. since it is happening on both openblas and mkl, it is was the cpu not the software eventhough the slowdowncan be vary each blas method. it is normal? or i am getting flawed?

2. the drop in the bigger matrix is quite significant: 419 gflops, and i had another for comparison 3900x ran at 490-500 gflops 18000x18000 n-size matrix, starting from 344 gfops at 2000x2000 n matrix. is it normal?

Ruqiu_C_Intel · ‎05-25-2025

The performance drops as matrix size increases is primarily due to cache limits and memory bandwidth. It is a normal phenomenon.

For the difference between Intel and AMD results. Can you share out more your observation (or performance data) for i9 14900k vs 3900? Also it would helpful to send us your sample code well as test methods.

tat0k · ‎05-26-2025

hi Mr. Ruqiu.

I was testing the i9 14900K for consistency for inversing matrix using DGESV and DGETRF-DGETRI. ~~I observed performance inconsistencies~~ ..

So i tried to check first DGEMM performance. ( update: inverse matrix different DGESV/DGETRF fairly consistent. ).

This Fortran program benchmarks the performance of a double-precision general matrix multiplication (DGEMM) operation, C = α*A*B + β*C, by calling an external BLAS (Basic Linear Algebra Subprograms) library. The Fortran code itself does not contain any OpenMP pragmas and is serial from its own perspective. The parallelism in my benchmarks comes from the external BLAS library (Intel MKL or OpenBLAS) that provides the DGEMM implementation. These libraries are, to my understanding, highly optimized and internally threaded. The program is attached here as fortran_dgemm2.f90.

I compiled the benchmark using the Intel(R) Fortran Compiler (ifx), Version 2025.0.4 Build 20241205 (from Intel oneAPI), with flags for MKL:parallel, and also tested with OpenBLAS for comparison. My initial tests used unset MKL_NUM_THREADS and OMP_NUM_THREADS environment variables, and I then varied these settings in subsequent trials, as suggested by Mr. Ron Green.

I was quite astonished last week by the results when MKL_NUM_THREADS and OMP_NUM_THREADS were unset – a configuration I usually start with.. When it plummeted to 80-110 GFLOPS large matrix from ~800 GFLOPS small matrix, I thought my processor was defective.

C:\temp\dgemm>fortran_dgemm2.exe
Fortran DGEMM Benchmark (k = 5000 , using BLAS DGEMM - MKL expected)
Kompiler: Intel Fortran (ifort)
Timer: SYSTEM_CLOCK (Wall-Clock Time)
---------------------------------------------
m=n | GFLOPS | Time (s)
---------------------------------------------
2000 | 800.0000 | 0.050000
4000 | 683.7607 | 0.234000
6000 | 787.7462 | 0.457000
8000 | 87.6352 | 7.303000
10000 | 70.5169 | 14.181000
12000 | 78.9733 | 18.234000
14000 | 85.9084 | 22.815000

My i9-14900K exhibited a stark difference in performance based on matrix size and thread configuration.

Small Matrices (m=n <= 6000): Achieved very high GFLOPS (peak ~816 GFLOPS at m=n=2000) when OMP_NUM_THREADS was unset or set to 32 (and MKL_NUM_THREADS was unset or 32).
Large Matrices (m=n >= 8000): In these same high-thread configurations (unset/unset, 32/unset, 32/32), performance dramatically dropped to ~80-110 GFLOPS. This indicates a severe bottleneck or scheduling issue when attempting to utilize all threads (including E-cores) for larger, more memory-intensive problems.
Interestingly, even when I explicitly set OMP_NUM_THREADS or MKL_NUM_THREADS to 32 (to utilize all logical cores) during the tests where performance plummeted for large matrices, I observed that CPU utilization often appeared to be only around half of the available capacity, suggesting that perhaps only 16 threads were being effectively utilized or were not being saturated.
Configurations that effectively limited the number of threads used by MKL to around the P-core count (e.g., OMP_NUM_THREADS=16 with MKL_NUM_THREADS unset, or MKL_NUM_THREADS=16 regardless of OMP_NUM_THREADS being 16 or 32) showed much more consistent and stable performance for larger matrices, albeit at a lower level (around ~414 GFLOPS).
When thread settings were left unset (or set to utilize all 32 threads) on the i9-14900K, performance on large matrices plummeted, making it far less effective than the 3900X for these scenarios.
The i9-14900K with OpenBLAS (ifx), when P-cores are targeted (OMP=16), provides a stable ~397 GFLOPS.
The i9-14900K's hybrid P-core/E-core architecture presents a challenge for this DGEMM benchmark, especially with large matrices when attempting to use all available threads. The E-cores might not be contributing effectively, or could be causing contention, leading to the observed performance degradation.
Therefore, for hybrid processors like the i9-14900K, it appears that MKL_NUM_THREADS and/or OMP_NUM_THREADS must be explicitly set to achieve consistent (or optimal for large matrices) MKL performance at this time..

Ruqiu_C_Intel · ‎06-03-2025