Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

2 Pages12>
Grid Server keeps disconnecting
MatthewSteeples
#1 Posted : Sunday, July 23, 2017 8:52:14 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
Just updated all of our grid nodes and clients to 3.10 and we're getting an error while building the project which causes the client and server to disconnect (which then restarts the process on reconnection, which fails again).

The log shows the following:

Code:
[16:52:55.8884-?-15] Ceasing to send messages because of an error (was the connection closed?): System.IO.IOException: Unable to write data to the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
   at System.Net.Sockets.Socket.EndSend(IAsyncResult asyncResult)
   at System.Net.Sockets.NetworkStream.EndWrite(IAsyncResult asyncResult)
   --- End of inner exception stack trace ---
   at System.Net.Sockets.NetworkStream.EndWrite(IAsyncResult asyncResult)
   at nCrunch.Core.Grid.Connectivity.GridMessageSender.(IAsyncResult )
[16:52:55.9294-NodeProcessor-18] The build task runner process has been terminated.
[16:52:55.9304-NodeProcessor-4] The task runner process has been terminated.


It happens on more than 1 client and all of the gridnodes. We've left the defaults (so haven't re-enabled compression). Not sure what else I can pass on to help diagnose this, so if there's anything let me know.
Remco
#2 Posted : Monday, July 24, 2017 12:52:47 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi, thanks for sharing this problem.

What level of consistency do you see with this problem? Does it happen every 60 seconds or so? Does the grid server manage to run any tests?

Is there any chance you could submit a bug report after it happens to you? v3.10 included some reworking of the timeout system. If the timeout system is not working correctly, then the bug report log file might show this.
MatthewSteeples
#3 Posted : Monday, July 24, 2017 1:56:02 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
Just submitted a bug report. Not sure how helpful it will be as the problem looks to be in the Gridnode (not sure how to submit a bug report for that)

It happens every 3-4 minutes. Basically NCrunch loads the snapshots and starts building and running tests before crashing. Because of our project structure, we have some test projects that don't depend on the whole solution being build (which is why it manages to run some tests first). When it resumes after crashing, it starts again building everything again (and gets stuck in that loop).

I would say it looks more like a crash than a timeout. Looks like it happens consistently in the same place.

Remco
#4 Posted : Tuesday, July 25, 2017 12:18:40 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Thanks for sending through the logs. This is quite puzzling.

The logs show the problem as occurring on the grid node. When attempting to send a routine message through the socket, the node gets hit with an exception bubbling up from down the stack. It's as though the connection was forcefully closed from outside the application/service, though in actual fact, the connection is still open because a later message can still be sent through.

This really doesn't make much sense to me. Despite heavy use of the same socket code, I can't reproduce the error at all. There don't appear to be any race conditions involved and there is no more specific error information. It just explodes ...

I'm wondering if we can try shuffling things in your environment a bit to try and figure out a way to reproduce the problem. The first thing worth trying is to re-enable data compression on the grid node(s) and clients. To do this, you need to turn on the 'CompressGridDataOverNetwork' global configuration setting in the NCrunch client. On the grid node server, you'll need to use a registry editor to turn this on (it's not listed in the configuration tool). Go to "HKEY_LOCAL_MACHINE\SOFTWARE\Remco Software\NCrunch Grid Node", create a new key with the name 'CompressDataOverNetwork' and value 'True'. Make sure you restart the node after you've done this.

This will effectively take you back to v3.9 in terms of the data exchanged, with the exception of timeout handling (which doesn't appear to be a problem in this case).

The next thing to try is to see if the problem appears when running the grid node locally on your own machine. Try installing the grid node on your workstation, running ncrunch.gridnode.console.exe, then connecting to the loopback connection (localhost). It would be interesting to see whether the problem still appears without the complexity of the network involved.
MatthewSteeples
#5 Posted : Tuesday, July 25, 2017 10:23:17 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
Hi Remco,

Problem still occurs with compression enabled, and when the grid node is running locally on 127.0.0.1

There's no issues whatsoever when running within Visual Studio, everything builds and runs fine. It's just the Grid Nodes that have an issue
Remco
#6 Posted : Tuesday, July 25, 2017 11:45:11 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Do you have any other solutions that you can test the grid with? I'm wondering if this problem may in some way be data related.
MatthewSteeples
#7 Posted : Tuesday, July 25, 2017 12:14:50 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
I've tried a couple of other ones and they seem to load fine. I don't have any others that are the scale of this one (2000+ tests, 300+kloc)

I don't know if this shows in the logs at all, but there's a possibility that it's not transferring all of the files across. Whenever it resets, it always has the same number of files to transfer, and while that can be seen to count up on other projects, it doesn't seem to do any counting for this one
MatthewSteeples
#8 Posted : Tuesday, July 25, 2017 12:26:39 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
Just discovered that this problem is only on 2 of our 3 active grid nodes. Trying to work out what's different about the one that works
Remco
#9 Posted : Tuesday, July 25, 2017 11:11:01 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
MatthewSteeples;10834 wrote:
Just discovered that this problem is only on 2 of our 3 active grid nodes. Trying to work out what's different about the one that works


Thanks, I'd be really interested to know this. I'm still at a bit of a loss here on how to reproduce this problem. Any extra data you can give me to help narrow this down would be of huge help.
MatthewSteeples
#10 Posted : Wednesday, July 26, 2017 4:42:46 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
Afraid I've not been able to find any differences yet. Same build of Windows 10, same build of Visual Studio, both on SSDs, have cleared out both NCrunch Snapshot folders.

The only thing that could be is that the 2 ones that don't work have _previously_ had preview builds of VS installed. They're due a re-install anyway so hopefully that will fix it.

One thing I hadn't noticed before today was that I got this exception (not sure whether it's related):

Code:
System.NullReferenceException: Object reference not set to an instance of an object.
   at nCrunch.Core.Processing.RuntimeTaskFacilitator.()
   at nCrunch.Core.Processing.RuntimeTaskFacilitator.PrepareForProcessing()
   at nCrunch.Core.Processing.TestExecutionTaskLogic.PrepareForProcessing()
   at nCrunch.Common.ErrorHandler.DoWithErrorHandling(Action action, Object context)
Remco
#11 Posted : Wednesday, July 26, 2017 11:07:17 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
The strangeness of this behaviour suggests a possibility that something is going wrong deep down in the stack. If you haven't already, I'd suggest making sure all the Windows updates and .NET Framework updates have been installed on the grid nodes that are showing the problem.

The exception you've quoted above won't be related to the loss of connectivity. This looks to be a fairly internal problem related to a testing task and is in no way connected to the network code. Usually the error handling will kick this back and the engine should continue.
Grendil
#14 Posted : Sunday, July 30, 2017 6:19:53 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 3/18/2017(UTC)
Posts: 54
Location: United States of America

Thanks: 22 times
Was thanked: 11 time(s) in 10 post(s)
I'm seeing what might be a similar issue. In the Distributed Processing window I'll see exclamations appear next to node names, and when I got look on the error tab I'll see "Object reference not set to an instance of an object" as the logged error there. If I click it I see
Code:
System.NullReferenceException: Object reference not set to an instance of an object.
   at nCrunch.GridNode.ServerNode..()
   at nCrunch.Common.ErrorHandler.DoWithErrorHandling(Action action, Object context)

I haven't yet spotted that error in the Node server logs (set to Detailed), so perhaps it's coming from the client?

Possibly related: I'm also seeing some confusing feedback which seems like things are getting out of step between my dev machine and the nodes. At times all tasks in the processing queue are "Pending" (none "Processing"), while the node itself looks idle except for some Ping messages in its log. Other times the node status is "Online" while it's actively processing tasks, which I had thought would correspond with a "Processing" status. And still other times nodes have become stuck "Negotiating", and I've ended up power cycling the VM to clear that. (I believe I first tried merely restarting the service, but it didn't always resolve.) It's also worked flawlessly for much of the time too, so it's been a challenge to sort out why it stops. These nodes are new VM's our IT staff spun up for this purpose. I'm wondering if we're getting low level failures in the virtual environment stack, but I don't know how to pin down the issue.
Remco
#15 Posted : Sunday, July 30, 2017 10:16:54 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi Grendil,

The problems you've described I think are quite different in nature to Matthew's. The exception you've provided looks to be thrown in the server side processing code. I don't think that this error is network related. Probably you've managed to find a flaw in this code that may be surfaced by a certain sequence of actions or possibly the structure of your solution. How often do you see this error? Has it started occurring recently, or has it been going on for some time?

I may have a more concrete explanation for the problem of nodes being stuck in 'Negotiating'. When NCrunch runs tests on the grid node, it spawns a number of child processes (i.e. nCrunch.TestHost, nCrunch.BuildHost, etc). Under later versions of VS, these child processes can indirectly spawn new child processes through the tool stack (for example, building a project always starts up VBCSCompiler.exe). These third-tier child processes don't have a lifespan under NCrunch's control, but are often still considered to be children of the grid node root process in the Windows process hierarchy. If a process is suddenly terminated or restarted without completely closing down open sockets, Windows will usually close the sockets itself when the process terminates. However, there is an exception to this if the process being terminated has children that are still active, in which case Windows keeps the sockets reserved for some reason and anything remotely connecting to them gets a 'ghost' connection.

I've yet to find a clean way to handle the above scenario. The grid node does clean up its socket connections on termination, but sometimes a forceful restart may kick in before this happens. Keep an eye out for this and see if it might be the cause of some of your problems.
Grendil
#16 Posted : Monday, July 31, 2017 12:22:56 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 3/18/2017(UTC)
Posts: 54
Location: United States of America

Thanks: 22 times
Was thanked: 11 time(s) in 10 post(s)
Thanks Remco for the continued top notch support. I much appreciate your prompt replies and I regret sending so many questions your way.
Remco;10894 wrote:
The problems you've described I think are quite different in nature to Matthew's. The exception you've provided looks to be thrown in the server side processing code. I don't think that this error is network related. Probably you've managed to find a flaw in this code that may be surfaced by a certain sequence of actions or possibly the structure of your solution. How often do you see this error? Has it started occurring recently, or has it been going on for some time?

We are just getting started with trying out the node farm and I've been seeing it since we started. It does seem to happen more often after we've first had a few successful test runs. We're not yet at the point of continuous testing, but have instead been mostly manually kicking off SpecFlow test suite trials and using NCrunch to parallelize the load. In terms of possibly unusual usage, those tests run for a minute or longer, and use Selenium to drive FireFox. This does all seem to work though.

Part of why I thought it might be network related is because today I've been working via VPN, and my connection has seemed slow and slightly unstable. At the same time I've been seeing that Object Not Set error pop up on all of the nodes (but not simultaneously) even early on while they were still first "Initializing". The prior occurrences however were while I was on the office LAN.

Remco;10894 wrote:
I may have a more concrete explanation for the problem of nodes being stuck in 'Negotiating'. When NCrunch runs tests on the grid node, it spawns a number of child processes (i.e. nCrunch.TestHost, nCrunch.BuildHost, etc). Under later versions of VS, these child processes can indirectly spawn new child processes through the tool stack (for example, building a project always starts up VBCSCompiler.exe). These third-tier child processes don't have a lifespan under NCrunch's control, but are often still considered to be children of the grid node root process in the Windows process hierarchy. If a process is suddenly terminated or restarted without completely closing down open sockets, Windows will usually close the sockets itself when the process terminates. However, there is an exception to this if the process being terminated has children that are still active, in which case Windows keeps the sockets reserved for some reason and anything remotely connecting to them gets a 'ghost' connection.

I've yet to find a clean way to handle the above scenario. The grid node does clean up its socket connections on termination, but sometimes a forceful restart may kick in before this happens. Keep an eye out for this and see if it might be the cause of some of your problems.

Ah, this makes some sense. Indeed we're on VS 2017. I don't mind some kind of additional workaround process to mitigate that issue, but what should we do? Is there anything better than restarting the VM when this happens? Is there any way to restrict our usage to lessen the chance we trigger it? It would be much better if we didn't have to restart the NCrunch service, but could instead manually kill the 'ghost' somehow.
Remco
#17 Posted : Tuesday, August 1, 2017 5:15:17 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Thanks for the extra details and for sending through your server logs. With the data you've given me, I've managed to reproduce the server-side exception problem and I have implemented a fix for it. I'll try to get you a build including this fix ASAP.
1 user thanked Remco for this useful post.
Grendil on 8/1/2017(UTC)
Remco
#18 Posted : Wednesday, August 2, 2017 12:51:45 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Grendil
#19 Posted : Thursday, August 3, 2017 11:54:34 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 3/18/2017(UTC)
Posts: 54
Location: United States of America

Thanks: 22 times
Was thanked: 11 time(s) in 10 post(s)
I am trying out the new build. I updated my VS to use it and confirm the new version in the About box. I update a few nodes to use it, and confirm their installed version in add/remove programs. But now my VS gets a connection failure (target machine actively refused) when hitting the upgraded nodes. The nodes I didn't upgrade continue to connect fine.
Remco
#20 Posted : Friday, August 4, 2017 3:04:35 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Do you receive the same result when trying to connect to your local machine running the grid node using ncrunch.gridnode.console.exe? This build doesn't have much changed over v3.10. Mostly it's just a null check to avoid the NRE. There must be another variable here somewhere.
Grendil
#21 Posted : Friday, August 4, 2017 4:35:00 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 3/18/2017(UTC)
Posts: 54
Location: United States of America

Thanks: 22 times
Was thanked: 11 time(s) in 10 post(s)
Remco;10920 wrote:
Do you receive the same result when trying to connect to your local machine running the grid node using ncrunch.gridnode.console.exe? This build doesn't have much changed over v3.10. Mostly it's just a null check to avoid the NRE. There must be another variable here somewhere.


Ah, I thought we had already rebooted the VM's after the install but we had not. So Window was perhaps keeping that ghost connection open, preventing connection to the service. In any case, rebooting the node VMs resolved this.
1 user thanked Grendil for this useful post.
Remco on 8/5/2017(UTC)
MatthewSteeples
#12 Posted : Monday, August 7, 2017 7:30:47 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 142
Location: United Kingdom

Thanks: 7 times
Was thanked: 19 time(s) in 17 post(s)
Remco;10859 wrote:
The strangeness of this behaviour suggests a possibility that something is going wrong deep down in the stack. If you haven't already, I'd suggest making sure all the Windows updates and .NET Framework updates have been installed on the grid nodes that are showing the problem.

The exception you've quoted above won't be related to the loss of connectivity. This looks to be a fairly internal problem related to a testing task and is in no way connected to the network code. Usually the error handling will kick this back and the engine should continue.


All devices are fully up to date in terms of Windows Updates (which now includes .NET Frameworks) and Visual Studio editions. Got a different error today in the Distributed Processing Pane

Code:
System.InvalidOperationException: Process has exited, so the requested information is not available.
   at System.Diagnostics.Process.EnsureState(State state)
   at System.Diagnostics.Process.get_WorkingSet64()
   at nCrunch.Core.ProcessManagement.ExternalProcessManager.StopUsingProcess(ExternalProcess process, ProcessPoolType processType)
   at nCrunch.Core.BuildManagement.BuildProcessLauncher.(Action`1 , ProcessorArchitecture , GridClientId , BuildSystemParameters , IList`1 )
   at nCrunch.Core.BuildManagement.BuildProcessLauncher.BuildComponentInExternalProcess(ComponentBuildParameters parameters, VisualStudioVersion vsVersion, GridClientId client, IList`1 customEnvironmentVariables)
   at nCrunch.Core.BuildManagement.BuildEnvironment.Build(SnapshotComponent snapshotComponentToBuild, IList`1 referencedComponents, GridClientId gridClientId, IList`1 customEnvironmentVariables, IPlatformBuildExtender extender)
   at nCrunch.Core.Processing.BuildTaskLogic.DoProcessTaskAndReturnSuccessFlag()
   at nCrunch.GridNode.NodeTaskProcessor..()
   at nCrunch.Common.ErrorHandler.DoWithErrorHandling(Action action, Object context)


Not sure whether that helps at all
Users browsing this topic
Guest
2 Pages12>
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.174 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download