Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

NCrunch hangs during Build with Engine Core Load 97%
Kelon
#1 Posted : Tuesday, August 3, 2021 10:30:00 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
After several months of living with this, I've inspected this issue a bit deeper.

Issue:
Having a large solution with about 150 projects and a developer machine with 16 cores and thus NCrunch with heavy multithreading, I've got a blocker many times. Asking my team members it seemed to be dependent on how many cores are used for the MSBuild building process.
Especially in the 'Run impacted tests automatically'-Mode it gets rather common:
Opening the solution or change any impacted file, the build starts running. After a short time, the CPU goes back to idle, while NCrunch states a high Engine Core Load, usually 97%. No further tests will be executed. The usual solution to this was to manually kill nCrunch.EngineHost, wipe some of the cache, etc and reenable NCrunch.

Research:
I've logged the whole build pipeline and found nothing special. But while debugging the EngineHost, besides many UnauthorizedAccessExceptions, one Exception happened quite before the blocker:
Code:
nCrunch.Compiler.Cil.Utility.CilResolutionFailedException
  HResult=0x80131500
  Message=The expected type with identifier '10241287978610504889' was not found in this assembly
  Source=nCrunch.Compiler
  StackTrace:
   at nCrunch.Compiler.Cil.Types.CilExportedTypeTable.GetTypeById(UInt64 typeId)

Unfortunately this is the whole stack trace.

After some configuration fiddling, this is what I found out:

  • This only happens with massive multithreaded build, setting "Max number of processing threads" to "1" solves the problem, setting it to below 4 loosens it. So this is definitively a thread safety issue.
  • But more important: Setting "Impact detection mode" to "WatchText" solves the problem even without limiting the max number of processing threads. So this issue belongs to the "CompareIL" change detection mode.


Quick Fix for all users who currently have this issue:
Configuration -> Performance -> Set "Impact detection mode" to "WatchText"

Bugfix request:
The CilResolutionFailedException seems to happen whenever "Impact detection mode" is set to "CompareIL/default" and projects are built in parallel, losing their Type identifier. This seems to be a thread-safety issue or unhandled exception. It would be great to get rid of this build blocker whatever the solution might be.

Thanks for your support. :)
Remco
#2 Posted : Tuesday, August 3, 2021 12:55:16 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Thanks for sharing this issue in so much detail. I appreciate how hard it must have been to extract this exception.

I agree that this looks like a race condition in the type resolver. I've reviewed the code involved and have a theory about what could be causing this. Unfortunately, I don't have a way to test it yet as it looks like the scope of the race is very narrow. I'd like to try implementing a codefix and arranging a build for you later this week. If the fix doesn't work, I might need to try and build some targeted tests that can surface the issue under load with heavy churning. I'll let you know as soon as I have the build available.
Kelon
#3 Posted : Tuesday, August 3, 2021 1:06:28 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
I think the easiest way to reproduce this exception for development is to change GetTypeById...) to always throw the exception or randomly throw it - even in a small solution.

But I'm glad to help and retest it with a build. :)
Remco
#4 Posted : Monday, August 9, 2021 7:52:32 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Kelon
#5 Posted : Tuesday, August 10, 2021 8:48:53 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
Thanks :)
Installed, building up some more stress testing...
Kelon
#6 Posted : Wednesday, August 11, 2021 8:45:53 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
After a while of testing in CompareIL/default Mode and Run impacted tests:
It has gotten better - no exception thrown, no more hang during build & first full run.

But I still get the EngineHost hanging, especially in "Run pinned tests"-Mode even without having any pinned tests.

Attached to the EngineHost-Process, it locks on:
Code:
nCrunch.Core.ProcessManagement.ExternalProcessManager.StopUsingProcessExternalProcess, ProcessPoolType)

at the call of lock/Monitor.Enter
Remco
#7 Posted : Thursday, August 12, 2021 12:14:53 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
A deadlock! Are you able to spot the other side of the lock? (i.e. any other threads inside ExternalProcessManager?)

Also, are you using MSTest for any of your tests?
Kelon
#8 Posted : Thursday, August 12, 2021 2:18:59 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
All tests are running with NUnit3.

No, I havent found any other sides yet. Currently I've been only the default VS Debugger which only shows IL. But I'll keep my eyes open to gather more information.
Kelon
#9 Posted : Monday, August 16, 2021 2:39:57 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
Ok, I've got a snapshot in JetBrains dotTrace. Its quite small - 26MB zipped. Tell me if you need it.

It seems to lock on CoreDispatcher -> ClearProcesses -> TerminateProcess
while a StopUsingProcess in a Worker is already running, but tries to call TerminateProcess
Remco
#10 Posted : Tuesday, August 17, 2021 1:07:58 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Kelon;15604 wrote:

It seems to lock on CoreDispatcher -> ClearProcesses -> TerminateProcess
while a StopUsingProcess in a Worker is already running, but tries to call TerminateProcess


Thanks! This helps a great deal.

I've examined the code involved, and interestingly, this doesn't actually look like a deadlock. The code here has the potential to hang in one of two places:

1. The O/S is failing to fully terminate/cleanup the instance of a taskhost (i.e. nCrunch.TestHost.*.exe), and NCrunch is stuck in a loop waiting for the .EXE to terminate. I haven't seen this happen on my side under NCrunch, but it may depend on what the code under test is doing at the time the O/S tries to terminate the process. I would expect that there are critical service calls that the O/S must complete before the process can fully terminate. An example could be something like a background thread that runs during your test run that calls into some deep service routine that never comes out. It might be possible to figure this out by attaching a debugger to any nCrunch.TestHost.*.exe processes and examining what they're doing at the time the engine hangs.

2. You have an MSTest test that is holding file locks onto files copied to its internal MSTest sandbox directory, and this is happening on a large enough scale that the timeouts of 15 seconds per task instance are long enough to cause the perception of a hang. As you've mentioned you're not using MSTest, I don't think this applies to you. Just thought I'd mention it as it IS still a potential hang point.

It should be possible to examine both of the above cases by turning on your logging with a verbosity of Medium. When NCrunch tries to terminate a process, it logs the message 'Process XXXX is terminating'. The XXXX is the ID of the Windows process. If the log messages for the thread involved suddenly stop after this message, then you can be sure its the process with the specified ID that is failing to complete termination and is causing the hang.

Something that could be worth trying is to turn off the 'Terminate test runners on complete' setting. This won't resolve any problems caused by processes failing to terminate, but it will change the life cycle involved by pushing the termination later in the sequence and reducing the number of threads involved. It might provide you with a workaround.
Kelon
#11 Posted : Thursday, August 19, 2021 5:38:25 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 10/14/2016(UTC)
Posts: 9
Location: Switzerland

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
Thanks for this reply.

I've tried 'Terminate test runners on complete':false
After several hours of coding, NCrunch stayed stable! :)

Your guess #1 sounds possible - I haven't debugged it yet. I'll try to find the responsible O/S handle... probably one of the IL Code generators. (Well, hopefully not!)

But until then, this helped me a lot!
1 user thanked Kelon for this useful post.
Remco on 8/20/2021(UTC)
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.073 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download