Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Console Tool never exits
CsabaFabian
#1 Posted : Monday, October 10, 2016 8:17:52 AM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
This issue appeared very recently - and is reproducible on our build servers and also locally. NCrunch _seems_ to run all the tests and then it just hangs for an hour until the build system kills it and fails the build.

Reproduced with 2.27.0.1 and 2.27.0.3.
Remco
#2 Posted : Monday, October 10, 2016 11:14:22 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
Hi, thanks for sharing this issue.

Does this happen with every execution run? Are you using distributed processing?
CsabaFabian
#3 Posted : Monday, October 10, 2016 1:47:00 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
Yes, using distributed processing. It happens with _nearly_ every run.

In fact, I thought I fixed the issue earlier. I made sure that I didn't have both .net 4.6.1 and .net 4.6.2 projects, and then I had one successful run. However, it was the exception, not the rule. I've had at least 5 hanging runs since.
Remco
#4 Posted : Monday, October 10, 2016 9:59:37 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
Let's see what the log file can tell us!

Try:

NCrunch.exe -LogVerbosity Detailed > MyLog.txt

The log file will be very big, but the tail end of it could be very revealing. Look out for log messages at the end of the log related to 'EngineWorkCompletionMonitor'. This is a component responsible for monitoring the engine and informing it when the end-to-end run is complete. The messages it emits can often be useful in identifying where the engine has hung.
1 user thanked Remco for this useful post.
CsabaFabian on 10/13/2016(UTC)
CsabaFabian
#5 Posted : Thursday, October 13, 2016 11:42:11 AM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
Sorry, for the delay. I'll give it a go right now...
CsabaFabian
#6 Posted : Thursday, October 13, 2016 12:52:11 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
I am seeing lots of lines like these:
Line 179653: [14:25:17.5946-Core-120] Event [TaskProcessedEvent] is being published on thread CoreThread to subscriber: EngineWorkCompletionMonitor.
Line 179678: [14:25:17.6102-Core-120] Event [TaskProcessedEvent] is being processed on Core thread with subscriber: EngineWorkCompletionMonitor.
Line 179679: [14:25:17.6102-Core-120] EngineWorkCompletionMonitor: Rejecting work complete state because local tasks are still in progress
Line 180270: [14:26:27.3276-Core-49] Event [TaskProcessedEvent] is being published on thread CoreThread to subscriber: EngineWorkCompletionMonitor.
Line 180295: [14:26:27.3276-Core-49] Event [TaskProcessedEvent] is being processed on Core thread with subscriber: EngineWorkCompletionMonitor.
Line 180296: [14:26:27.3276-Core-49] EngineWorkCompletionMonitor: Rejecting work complete state because local tasks are still in progress
CsabaFabian
#7 Posted : Thursday, October 13, 2016 1:21:51 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
I noticed in the log that the console tool was unsuccessfully connecting to two windows 7 nodes.
- One that had restarted during night (we are running the nodes with an active session, not as services, since we run UI tests on them); restarted this one.
- Another Win7 node is not yet configured so it's not running but it was present in the solution file. I removed it.

After performing the above, I'm getting the following in the log:
Line 128420: [15:15:04.533-Core-11] Event [TaskProcessedEvent] is being processed on Core thread with subscriber: EngineWorkCompletionMonitor.
Line 128421: [15:15:04.533-Core-11] EngineWorkCompletionMonitor: Rejecting work complete state because local tasks are still in progress
Line 128593: [15:15:07.1382-Core-7] Event [TaskProcessedEvent] is being published on thread CoreThread to subscriber: EngineWorkCompletionMonitor.
Line 128618: [15:15:07.1382-Core-7] Event [TaskProcessedEvent] is being processed on Core thread with subscriber: EngineWorkCompletionMonitor.
Line 128619: [15:15:07.1382-Core-7] EngineWorkCompletionMonitor: Rejecting work complete state because tasks exist that can be processed on grid
CsabaFabian
#8 Posted : Thursday, October 13, 2016 2:40:34 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
I realized as well, that another node was out of disk space. So I conducted an experiment, running on only 3 nodes (out of the 12), that didn't have any issues - and NCrunch finished nicely!

So, I must ask, if a node configured in the solution file has run out of space, or the node is not running in the first place, should they not just be ignored and processing done on the other available nodes?
CsabaFabian
#9 Posted : Thursday, October 13, 2016 3:12:04 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
I ran another test, and indeed this seems to be the case: if a node is mentioned in the solution file and it's not running, NCrunch console never finishes and the progress log entries seem to indicate that there is exactly one test that never completes.

This makes it difficult to maintain a larger number of nodes, as ALL of them have to be 100% functional at any time for the continuous integration to run. Is there any workaround?
Remco
#10 Posted : Thursday, October 13, 2016 10:02:49 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
This is a good catch and I can see how this would happen. When a client has work in the queue that it thinks can be processed by a grid node but the grid node is not working properly, the client will wait until the grid node becomes available and executes the work before it considers the run to be complete.

The key point of interest here is that the client should only do this if there is unprocessed work in the queue. So even if there are grid nodes offline or in some kind of half-state, the client should only sit waiting if there is work that only these nodes can process.

The nodes that were offline, did they have any special capabilities required by tests in your solution? I'm wondering why the unprocessed tasks weren't simply handed over to nodes that were working correctly.
CsabaFabian
#11 Posted : Friday, October 14, 2016 5:47:10 AM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
The short answer is "no".

This is the setup of my last test:
- 3 nodes online, one each on Win7, Win8.1 and Win10 (specified as capability)
- one node offline (NCrunch not running on the machine) on Win7
- Run the test suite with lots of tests distributed by capability using the console tool

The result is that the tool hangs waiting for one test to finish.
I suspect this could be a bug introduced in 2.27.x, but I am not sure. We also had fewer nodes before.
Remco
#12 Posted : Friday, October 14, 2016 8:52:26 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
It's possible that this is then actually an error handling problem on the node itself. If a node picked up some work from the client, then bugged out and failed to 'return' the work in any kind of state, the client would wait forever and the node would just sit in its failed state.

Do you have any record of the out of disk space error that was kicked up by your grid node? I'm wondering where in the chain this happened .. it's possible that it hit somewhere that the error wasn't handled the way it should have been.

v2.27 did include a change to the tracking of grid processing, making the client wait for 60 seconds before declaring the work 'done' if a grid node was disconnected. I don't think this change is responsible for the issue you've encountered - it's probably something that went wrong on the node itself.
CsabaFabian
#13 Posted : Friday, October 14, 2016 9:00:27 AM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
Remco;9293 wrote:
Do you have any record of the out of disk space error that was kicked up by your grid node?


Sorry, I don't. I know it's silly of me, but I overwrote it afterwards and fixed the out of space condition. However, I am still struggling with this so I think it's a more general problem.

Right now I have 11 functional nodes (out of the 12 I have in total) and only these 11 are in the config, and still have a hanging NCrunch. I have the log files, I'll look at them now.
CsabaFabian
#14 Posted : Friday, October 14, 2016 9:32:33 AM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
Remco, I wouldn't post the log file publicly but I would be happy enough to send it to you privately, if you would like that?
Remco
#15 Posted : Friday, October 14, 2016 10:44:59 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
CsabaFabian;9295 wrote:
Remco, I wouldn't post the log file publicly but I would be happy enough to send it to you privately, if you would like that?


That would really be great. The log file from the grid node that has hung would also be really useful if you're able to dig it up. You can zip up the logs and send them through the contact form.
Remco
#16 Posted : Friday, October 14, 2016 10:57:13 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
Thanks for sending through the log.

According to the log, the console tool is not terminating the run because there is a task still being processed.

This could be one of the following:

1. A test that is taking an extraordinary long time to execute. Although NCrunch does have timeout settings, the method of enforcing these timeouts is not 100% reliable. Sometimes if a test is stalled while executing certain APIs or critical code blocks, NCrunch cannot interrupt it and it may hang forever.

2. A critical failure on a remote grid node that has caused it to 'swallow' the task without reporting it as failed or completed. I presently don't know of any scenarios in which this can happen, but I'm sure it's far from impossible. If you're having trouble with disk space on your nodes then this is definitely suspect.

3. A test task that has a long aggregate execution time (i.e. the tests within it between them just take a long time to run). I'm going to assume that this isn't the case for you because if it were, you likely would just be patient and wait for the run to complete and would not take the time to report the problem here.

There are two ways to get a better handle on what is happening here:

1. Enable logging on the grid nodes. If a node is dying silently because of internal problems, this should show up in its own log.

2. Trying doing some end-to-end runs with NCrunch over this solution using the NCrunch VS plugin instead of the console tool. Ensure you use the grid nodes. You can then use the NCrunch UI to analyse issues and identify which tests are stuck. In theory, if there is a grid or test processing issue that affects the console tool run, it should also affect NCrunch when it runs in VS.
CsabaFabian
#17 Posted : Saturday, October 15, 2016 1:47:30 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
Hi Remco, thanks for looking into it.

#1 and #3 cannot be as far as I see. The usual run time for the test suite is 3-5 minutes.

I took your advise and did a run inside Visual Studio and turned on detailed logging on the nodes.

The one test that was shown as hanging on the UI is called The_contact_list.shows_the_list_of_contacts. The logs show that the test exited successfully in reality. I sent you the log from that node through the contact form.

Curiously looking forward to the results of the investigation.

Csaba
Remco
#18 Posted : Sunday, October 16, 2016 12:09:01 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
Hi Csaba,

Thanks for sending through the node logs for this run.

I wasn't able to find any data in the log that showed this test finished its execution on the grid node. According to the log, the test execution started at 15:14:45 and the thread never came back.

I suspect that something is causing this test to hang. Does the test interact with anything outside of its own process? Is there any multithreaded code or complex conditions in the test that might cause a hang? Does the test run correctly if you execute it locally without the grid involved?
CsabaFabian
#19 Posted : Sunday, October 16, 2016 10:39:09 PM(UTC)
Rank: Member

Groups: Registered
Joined: 10/10/2016(UTC)
Posts: 20
Location: Switzerland

Thanks: 1 times
Was thanked: 1 time(s) in 1 post(s)
Promising news. There is a good chance that this issue is "gone".

Some of the tests were running against a virtual file system driver and I realized that one of the nodes still had an old version of this driver. I fixed this and since then (10+ runs) I have had no hanging builds. Also the manual runs (also 10+) completed successfully.

It is still a mistery how the two things are connected (the test hanging in the log I sent you was NOT using this driver and was NOT running on the faulty node). The best theory I could come up is that perhaps the user mode part of the driver spun up a foreground thread that never exited and thus the test runner could not exit either...

So fingers crossed...

Csaba
Remco
#20 Posted : Sunday, October 16, 2016 11:15:35 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,145

Thanks: 959 times
Was thanked: 1290 time(s) in 1196 post(s)
Good to hear that the build isn't hanging now, though as you've mentioned, this doesn't completely add up. I have a feeling that we haven't seen the last of this problem.

If you see the problem start appearing again, it may be worth trying to recreate it in a debug session so you can figure out where it's hanging. If the hanging patterns are too erratic to get a debugger on them, adding some logic to drop trace logs into a file may help to pinpoint exactly where in the code the hangups are.
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.090 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download