I have an application that is 100% Delphi code. It is a 64 bit windows console application, with a workload manager, and a fixed number of workers. This is done via creation of threads and each thread is a worker. The thread does not die, it pulls works from its own queue that the workload manager populates.
This appears to work just fine.
What I am finding, however, is that on a 16 core system I am seeing processing times around 90 minutes (it has 2,000,000+ workloads; and each does database work). When I added 16 to 32 cores, I saw the performance drop! There is no database contention. Essentially, the DB is waiting for things to do.
Each thread has its own DB connection. Each thread's queries use only that threads connection.
I updated the Delphi MM to use ScaleMM2; which made a big improvement; but I am still at a loss as to why increasing cores reduces performance.
When app has 256 threads, on 32 cores, CPU total use at 80%.
When app has 256 threads, on 16 cores, CPU total use at 100% (which is why I wanted to add cores) -- and it got slower :-(
I have applied as much as the advice as I can understand to the code-base.
ie - Functions not returning strings, using Const for arguments, protecting "shared" data with small critical sections (actually using Multi-read Exclusive Write). I currently do not assign processor affinity; I was reading conflicting advice on using it .. so I am currently not (would be trival to add, just not there today).
Questions - slanted towards I "think" the issue is around thread contention ...
How do I find confirm thread-contention is the issue? Are there tools available specifically for this type of contention identification?
How can I determine what is using "heap" and what is not, to further reduce contention there?
Insights, guidance, pointers would be appreciated.
Can provide relevant code areas ... if I knew what was relevant.
Procedure TXETaskWorkloadExecuterThread.Enqueue(Const Workload: TXETaskWorkload);
// protect your own queue
Procedure TXETaskManager.Enqueue(Const Workload: TXETaskWorkload);
If FWorkloadCount >= FMaxQueueSize Then Begin
FWorkloadCount := 0;
// round-robin the queue
If FNextThread >= FWorkerThreads Then Begin
FNextThread := 0;
Function TXETaskWorkloadExecuterThread.Dequeue(Var Workload: TXETaskWorkload): Boolean;
Workload := Nil;
Result := False;
If FNextWorkload < FWorkloads.Count Then Begin
Workload := FWorkloads[FNextWorkload];
If Workload Is TXETaskWorkLoadSynchronize Then Begin
Result := True;
End Else Begin
FNextWorkload := 0;
Thanks for all the comments. Clarifications.
This system/VM has nothing else on it. The executable in question is the only thing using the CPU. Single threaded performance means linear. I have simply made this a divide/conquer. If I have 5,000,000 cars to park, and I have 30 drivers with 30 different parking lots. I can tell each driver to wait for the other drive to finish parking, it will be slower than telling 30 drivers to concurrently park cars.
Profiling in single threaded shows there is nothing that is causing this. I have seen mention on this board about Delphi and multi-core performance "gotcha's" (mostly related to string handling and LOCK).
The DB essentially is saying that it is bored, and waiting for things to do. I have checked with a copy of Intels vTune. Generally speaking, it says ... locks. But, I cannot find out where. What I have is pretty simple to my mind, and the current areas for locks are necessary and small. What I cannot see is locks that might be happening due to other things .. like strings creating a lock, or thread 1 causing some issue on the main process via accessing that data (even though protected via a critical section).
Continuing to research. Thanks again for the feedback/ideas.