Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

NCrunch console never exiting
ecargo
#1 Posted : Tuesday, November 17, 2020 11:23:38 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 8/17/2015(UTC)
Posts: 3
Location: New Zealand

Hi,

I'm running an NCrunch console test run, but it never completes. The following output was the tail of the logs copied from the terminal at 12:18:

Code:
[10:24:39.5298-Core-117] Checking for tasks to launch (1 out of 5 concurrent tasks are active)
[10:24:39.5298-Core-117] Stretch targets only = False
[10:24:39.5298-Core-117] Time critical tasks only = False
[10:24:39.5298-Core-117] Resources already in use = [Resources:X=, I=Debugger;Test Runner]
[10:24:39.5298-Core-117] Test filter = True
[10:24:39.5298-Core-117] Local processing allowed = True
[10:24:39.5298-Core-117] Capabilities =
[10:24:39.5298-Core-117] Publishing Event: [ProcessingQueueCapacityAvailableEvent]
[10:24:39.5298-Core-117] Event [ProcessingQueueCapacityAvailableEvent] is being published on thread CoreThread to subscriber: TestPipelineManager.☻
[10:24:39.5298-Core-117] Event [ProcessingQueueCapacityAvailableEvent] is being processed on Core thread with subscriber: TestPipelineManager.☻
[10:24:39.5338-Core-82] Event [TaskProcessedEvent] is being processed on Core thread with subscriber: StatusIndicatorSynchroniser.☻
[10:24:39.5338-Core-82] Event [NullEvent] is being processed on Core thread with subscriber: BufferedEventSubscription`1.engageTimer
[10:24:39.5528-?-80] Process with id 2424 has exited
[10:24:40.0453-Core-72] Event [♣] is being processed on Core thread with subscriber: EngineStateNotifier.☻
[10:24:40.0533-Core-72] Publishing Event: [EngineStateNotificationEvent:4 tests are queued for execution.  Monitoring 14117 tests (154 failing), with 11 tests ignored]
[10:24:40.0533-Core-72] Publishing Event: [StaticEngineEventsClearedEvent]
[10:24:40.0533-Core-72] Event [NullEvent] is being processed on Core thread with subscriber: BufferedEventSubscription`1.engageTimer
[10:24:40.4704-Core-84] Event [TestDataUpdatedEvent] is being processed on Core thread with subscriber: MetricsTreeSynchroniser.☻
[10:24:40.4704-Core-84] Event [NullEvent] is being processed on Core thread with subscriber: BufferedEventSubscription`1.engageTimer[10:26:39.5382-?-70] Publishing Event: [☼:nCrunch.Client.Model.LineMapDeallocator.♥]
[10:26:39.5382-?-70] Event [☼:nCrunch.Client.Model.LineMapDeallocator.♥] is being published on thread CoreThread to subscriber: ☻.☻
[10:26:39.5392-Core-56] Event [☼:nCrunch.Client.Model.LineMapDeallocator.♥] is being processed on Core thread with subscriber: ☻.☻


It appears that there are workers free, but 4 tests aren't being executed. I have an NCrunch timeout of 10 seconds configured, so it can't be a long-running test that's the problem. How can I figure out what's preventing these tests from running?

The timeout was output here:

Code:
[09:26:04.6333-Core-7] Default test timeout = '10000'
Remco
#2 Posted : Tuesday, November 17, 2020 11:30:45 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi, thanks for sharing this problem.

Quote:
[10:24:39.5298-Core-117] Checking for tasks to launch (1 out of 5 concurrent tasks are active)


This particular message indicates that there is a task hanging or running well over time. It might be possible to identify the specific task by increasing your log verbosity, but the easiest option may simply be to take out the hammer: Wait for the hang, then open task manager and kill every NCrunch BuildHost or TestHost process running on the machine. When you've killed the task, the engine should in theory kick up an error and complete the run. You'll then have more detailed reporting to help identify which task is hanging.

The test timeout handling in NCrunch is pretty resilient these days, but it's still possible for test environments to circumvent it if they are particularly unstable. It's quite possible you have a test hanging in a manner that the engine can't interrupt with timeout enforcement.
ecargo
#3 Posted : Wednesday, November 18, 2020 1:31:25 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 8/17/2015(UTC)
Posts: 3
Location: New Zealand

Hi Remco,

I killed the child processes and inspected the timeline, which made it quite obvious which tests were holding up the engine. Do you have documentation on how NCrunch timeouts work so that I can try to figure out why they didn't take effect in this case?

Thanks,
Bart
Remco
#4 Posted : Wednesday, November 18, 2020 8:34:58 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi Bart,

NCrunch enforces timeouts in two different ways:

1. ThreadAbortException: NCrunch hits a hanging test thread with ThreadAbortException, causing it to fail its execution and report a stack trace. This is not performed under .NET Core (which does not support ThreadAbortException).
2. Process Kill: If the above doesn't work within the expected timeframe or it isn't supported, the test process will send an IPC message to the NCrunch engine requesting that the engine kill the test process.

That's the theory at least. In practice, it's not so simple:

- Because of the way that async blocks are used heavily through the more modern versions of NUnit and Xunit, it can be difficult to effectively pin down all the threads that are involved in the execution of tests within the process. It the test framework is executing threads that we cannot effectively manage, a safe abort may be unknowable and impossible. The process can end up in an inconsistent or unstable state.

- At a system level, ThreadAbortException is dangerous. Threads can be interrupted doing all kinds of unexpected things. MS tried to move .NET away from ThreadAbortException for good reason. However, we still prefer it as this approach increases the likelihood that we can obtain useful information about why the test was timing out (i.e. stack trace, trace output, etc).

- Due to the mechanics of integration with the test frameworks, there are race conditions that exist with timeout enforcement that are simply impossible for us to completely eliminate.

- Regardless of which of our two approaches is being used, the test process must have a certain level of structural stability to be able to enforce a timeout. For example, a process suffering from heap corruption may be unable to make the necessary allocations to be able to send a message to the hosting engine to trigger the kill command.

In the vast majority of cases, we can safely abort a test run and return a useful result indicating where the timeout happened. But this is not guaranteed. Test code is capable of calling into unmanaged APIs which could be doing any number of unknowable things where a ThreadAbortException or Process Kill could leave things in an inconsistent state. We do our best here but perfection is impossible. I recommend making sure you still have processes in place where you can recover your systems in the event of a failed timeout enforcement, and where possible try to write your code in a way that reduces the risk of timeouts/hanging in general.
bartj
#5 Posted : Monday, December 7, 2020 5:28:52 PM(UTC)
Rank: Member

Groups: Registered
Joined: 12/4/2013(UTC)
Posts: 26
Location: New Zealand

Thanks: 2 times
Was thanked: 3 time(s) in 3 post(s)
Thanks for the detailed explanation.

While in an ideal world, tests would never time out (and of course they would always pass), in the messy world of legacy software, tests will not complete from time to time - particularly when under heavy concurrent load. This concurrency is part of what we use NCrunch for. We currently use NUnit single threaded on the CI server because it is reliable and works well, however we want to be able to complete a console test run locally leveraging NCrunch's faster execution times through concurrency.

Could you please explain what we might be doing that could prevent the NCrunch process from killing a child process? This seems like something that is always possible, unless the process has frozen somehow in the Windows kernel. I also haven't seen any logging that would suggest that the kill was attempted and failed. Any hints on what I should be looking for in the logs?
Remco
#6 Posted : Tuesday, December 8, 2020 3:24:12 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
bartj;15161 wrote:

While in an ideal world, tests would never time out (and of course they would always pass), in the messy world of legacy software, tests will not complete from time to time - particularly when under heavy concurrent load. This concurrency is part of what we use NCrunch for. We currently use NUnit single threaded on the CI server because it is reliable and works well, however we want to be able to complete a console test run locally leveraging NCrunch's faster execution times through concurrency.


Understood! When you think about it, the act of running tests over untested code is all about trying to surface these kinds of issues. If our testing systems can pick up the issues, then we can keep them out of production environments. It's unfortunate that the stresses we inflict with our testing can also affect the stability of the environment we use for this testing, and as such the act of testing our software in this manner is a constant battle against intermittent problems that can be hard to pin down. If you haven't used it already, I strongly recommend making use of the churn mode feature to try and selectively trigger these sorts of problems in a desktop session where they may be easier to analyse. It's no fun when the CI blows up or hangs for an issue that only appears when you are least able to inspect it.

bartj;15161 wrote:

Could you please explain what we might be doing that could prevent the NCrunch process from killing a child process? This seems like something that is always possible, unless the process has frozen somehow in the Windows kernel. I also haven't seen any logging that would suggest that the kill was attempted and failed. Any hints on what I should be looking for in the logs?


Unfortunately, I have no way to answer this question, as to do so would involve random speculation on the workings of your code and the functioning of your entire environment.

In terms of trying to flush out and understand these problems, here are my suggestions:
- Churn mode is a great way to make intermittent problems consistent. Use it regularly. Churn selective parts of your test suite that are suspected to contain intermittent issues.
- You can actually implement your own timeout handling inside your test code. For example, when the test starts, try starting a background thread that will make a call to System.Diagnostics.Debugger.Launch() if the test thread doesn't reach the end of the test in an allowable time. By doing this, you can interrupt your own code in the process of a timeout (before NCrunch attempts any enforcement), and have a good chance of understanding why it happened.
- Turn on the terminate test runners on complete setting, then let the run continue until it hangs. The only TestHost processes that remain are categorically hung, because otherwise they would be terminated. You can hook the VS debugger onto these processes to examine their state and try to understand what happened.
- Try to build resilience into your test architecture. Add levels of exception handling and recovery. Instead of launching background threads directly from your production code, implement custom thread dispatchers that can be controlled by the tests, then track the background threads so you can identify and act on situations where these threads don't behave as they should. Your tests are clearly not just isolated unit tests, they are big, functional, complex and very valuable. Treat their operation with suspicion so that you can be more aware when race conditions appear and the code doesn't behave as it should.
- NCrunch.Framework.IsolatedAttribute is your friend. It works absolute magic when executing tests that might leave the process in an inconsistent state and affect tests executed later in the run.
bartj
#7 Posted : Thursday, December 10, 2020 12:39:06 AM(UTC)
Rank: Member

Groups: Registered
Joined: 12/4/2013(UTC)
Posts: 26
Location: New Zealand

Thanks: 2 times
Was thanked: 3 time(s) in 3 post(s)
Thanks, Remco.

I think I mustn't have worded my question clearly enough. It was in two parts:

1. I don't see any logging from NCrunch saying that it's tried and failed to kill my process. How do I know that NCrunch has in fact tried and failed to kill the process as opposed to not even trying for some unknown reason, potentially a bug in NCrunch?

2. Given that I have encountered very few scenarios where a process cannot be killed (e.g. when stuck in the kernel), and the fact that I can kill them manually, why can these child processes not be killed when the timeout is reached?
Remco
#8 Posted : Thursday, December 10, 2020 1:45:41 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
bartj;15170 wrote:

1. I don't see any logging from NCrunch saying that it's tried and failed to kill my process. How do I know that NCrunch has in fact tried and failed to kill the process as opposed to not even trying for some unknown reason, potentially a bug in NCrunch?


In the current latest release, the logging information captured from a process is only returned after the call into the test framework is completed. This means that you won't have the log data from the hung processes appearing in the log system.

We have a new system we are dogfooding at the moment that involves the processes sharing a common log using memory mapped files. This is expected to improve log handling for this sort of thing. However, I'm not optimistic this will help much with your problem as there is a good chance the state of your process is preventing timeout enforcement from properly taking place.

bartj;15170 wrote:

2. Given that I have encountered very few scenarios where a process cannot be killed (e.g. when stuck in the kernel), and the fact that I can kill them manually, why can these child processes not be killed when the timeout is reached?


As described, the logic to enforce a timeout is housed within the test process itself and is subject to whatever is happening in that process. If the process is halted, unstable, corrupted, or generally misbehaving, then this can prevent the call to terminate the process from taking place. It is presently not possible for the engine to arbitrarily enforce a timeout without advise from the test process, as it is the test process that has control over the execution run and coordination with the framework.

If you believe you have encountered a bug within NCrunch that prevents timeout enforcement from occurring when it should reasonably be expected to happen, we'll need very detailed information (and likely a code sample solution) to have any chance of reproducing and resolving such a problem. One way or another, you'll need to delve into the hung processes to understand what is happening here. I can't presently provide you with a direct answer or fix that will solve this for you. Sorry.
bartj
#9 Posted : Thursday, December 10, 2020 5:10:11 PM(UTC)
Rank: Member

Groups: Registered
Joined: 12/4/2013(UTC)
Posts: 26
Location: New Zealand

Thanks: 2 times
Was thanked: 3 time(s) in 3 post(s)
Having some visibility over what's happening in the process will be great. The shared memory approach is one that I've used in the past, and it worked really well for allowing real-time visibility into the current state of a process by an external observer.

I'll reply again on this thread if I get a chance to reproduce the issue with a small example, although I'd imagine that will be a time consuming process. I'm almost 100% certain that we're not corrupting .NET or the test runner process in a way that no code is able to be run.
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.073 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download