Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

2 Pages<12
Grid Server keeps disconnecting
Remco
#13 Posted : Monday, August 07, 2017 11:14:43 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
MatthewSteeples;10933 wrote:

All devices are fully up to date in terms of Windows Updates (which now includes .NET Frameworks) and Visual Studio editions. Got a different error today in the Distributed Processing Pane


I'm certain this is something else. I don't think this is likely to be a real issue here. It looks like the task runner process was unexpectedly terminated (which can happen in the event of stack overflows, unstable test code, etc). I don't think it's worth chasing this particular problem unless you're seeing it disturbingly often and it seems to be having an effect on the engine.


MatthewSteeples
#22 Posted : Thursday, August 10, 2017 4:47:56 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 48
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Is there anything we can do to send any more debug information through? This is becoming quite an issue for us now, and we're going to have to investigate rolling back if we can't get it resolved soon :(
Remco
#23 Posted : Thursday, August 10, 2017 8:22:59 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
MatthewSteeples;10955 wrote:
Is there anything we can do to send any more debug information through? This is becoming quite an issue for us now, and we're going to have to investigate rolling back if we can't get it resolved soon :(


It may be worth giving the build below a try:

http://downloads.ncrunch.net/NCrunch_GridNodeServer_3.11.0.2.msi
http://downloads.ncrunch.net/NCrunch_GridNodeServer_3.11.0.2.zip

This build contains some extra error handling and fallback code to deal with a couple of problems discovered by Grendil. I don't think you're experiencing the same problem, as your issue seems to be connectivity related rather than a functional issue. But it seems like it would be worth a try. The protocol used in this build is compatible with v3.10 clients.

Unfortunately I haven't been able to reproduce the random disconnection issue nor have I been able to find any potential cause for it through code review. Because the problem itself is occurring down the stack (i.e. in sockets) there is no further error information to capture and analyse.

You mentioned earlier that you'd identified a pattern, with the problem appearing on some nodes but not others. Does this still seem to be the case?
MatthewSteeples
#24 Posted : Thursday, August 10, 2017 10:02:44 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 48
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Sadly I wasn't able to find out what's different about the nodes. I verified versions of installed software, Windows, NCrunch, how it had been installed etc but nothing stood out

Same problem on 3.11.0.2, although I've got a few more log messages and a potential query. You can see at 22:55:22.5025 that it tries sending a 3mb message, and then at 22:55:22.5055 (0.003 seconds later) it tries sending another message. Could it be that the socket is busy when it's sending the message, or does the 22:55:22.5045 line indicate it was flushed? My network isn't fast enough to do 3mbyte in 0.002 seconds!

[22:55:22.4525-Core-50] Sending message 'nCrunch.Core.Grid.Messages.NodeBuildCompletedMessage' to remote client at address: 192.168.2.2:37982
[22:55:22.5025-Core-50] Sending non-file based message nCrunch.Core.Grid.Messages.NodeBuildCompletedMessage with size 3013921
[22:55:22.5025-Core-50] Writing 3013921 bytes to open connection
[22:55:22.5045-Core-50] Event [NodeTaskProcessedEvent:Building Ledgerscope.Core] is being processed on Core thread with subscriber: NodeWorkRequester.
[22:55:22.5045-Core-50] Requesting more work from the client
[22:55:22.5045-?-49] Queued data was sent
[22:55:22.5045-Core-50] Sending message '[NodeWorkRequest:CurrentSnapshotVersion=1,AvailableWorkers=4,ResourcesInUse=[Resources:X=, I=]]' to remote client at address: 192.168.2.2:37982
[22:55:22.5055-Core-50] Sending non-file based message [NodeWorkRequest:CurrentSnapshotVersion=1,AvailableWorkers=4,ResourcesInUse=[Resources:X=, I=]] with size 35
[22:55:22.5055-Core-50] Writing 35 bytes to open connection
[22:55:22.5065-?-49] Ceasing to send messages because of an error (was the connection closed?): System.IO.IOException: Unable to write data to the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
at System.Net.Sockets.Socket.EndSend(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndWrite(IAsyncResult asyncResult)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.EndWrite(IAsyncResult asyncResult)
at nCrunch.Core.Grid.Connectivity.GridMessageSender.(IAsyncResult )
[22:55:22.5075-?-49] Handling network error by disposing of connection to 192.168.2.2:37982
[22:55:22.5075-?-49] Closing connection to 192.168.2.2:37982 using CloseConnectionMessage
[22:55:22.5075-?-49] Sending non-file based message nCrunch.Core.Grid.Messages.CloseConnectionMessage with size 8
[22:55:22.5075-?-49] Writing 8 bytes to open connection
[22:55:22.7596-?-49] Queued data was sent
[22:55:22.7606-?-49] Cleaning up connection to 192.168.2.2:37982
[22:55:22.7616-?-49] Handling closure of connection to 192.168.2.2:37982
[22:55:22.7616-?-49] Server-side connection to 192.168.2.2:37982 has been lost
[22:55:22.7616-?-49] Publishing Event: [NetworkServerDisconnectedEvent:192.168.2.2:37982]
[22:55:22.7616-?-49] Event [NetworkServerDisconnectedEvent:192.168.2.2:37982] is being published on thread CoreThread to subscriber: NodeActivityTracker.
[22:55:22.7626-?-49] Event [NetworkServerDisconnectedEvent:192.168.2.2:37982] is being published on thread CoreThread to subscriber: GridNodeServer.
[22:55:22.7626-?-49] Event [NetworkServerDisconnectedEvent:192.168.2.2:37982] is being published on thread CoreThread to subscriber: NodeWorkRequester.
Remco
#25 Posted : Friday, August 11, 2017 2:15:47 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
MatthewSteeples;10961 wrote:

Same problem on 3.11.0.2, although I've got a few more log messages and a potential query. You can see at 22:55:22.5025 that it tries sending a 3mb message, and then at 22:55:22.5055 (0.003 seconds later) it tries sending another message. Could it be that the socket is busy when it's sending the message, or does the 22:55:22.5045 line indicate it was flushed? My network isn't fast enough to do 3mbyte in 0.002 seconds!


The 3meg transfer here actually involves writing the data into the socket's buffer. But I do find it very interesting that the disconnection always seems to happen at this point. This is why I thought it would be worth turning the compression back on, to see if using different data might make any difference.

How fast is your connection?
Remco
#26 Posted : Friday, August 11, 2017 5:10:47 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
I'm working on a theory at the moment that perhaps the size of the message is blowing the socket buffer. 3MB is unusually large for a build result message, probably this is quite a big project. Turning off instrumentation for this project might even work around the problem.

There may be a code change I can make to limit the size of these kind of messages being chunked into the buffer. I'll let you know as soon as I have a working implementation.
Remco
#27 Posted : Sunday, August 13, 2017 10:51:28 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
MatthewSteeples
#28 Posted : Tuesday, August 15, 2017 5:11:50 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 48
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Hi Remco,

That does appear to do something. One of the nodes that previously restarted now seems to keep running. I'll keep an eye on it and let you know if anything happens.

Thanks,
Matthew
1 user thanked MatthewSteeples for this useful post.
Remco on 8/15/2017(UTC)
MatthewSteeples
#29 Posted : Tuesday, August 22, 2017 9:58:22 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 48
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
It seems we're still getting this error on some build nodes. I'll enable logging again and see if anything stands out
MatthewSteeples
#30 Posted : Sunday, August 27, 2017 11:06:17 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 48
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Not sure if it's related, but I've just realised one configuration difference we have between some of the gridnodes (and I bet it's rare for anyone to be using it elsewhere). We make use of https://github.com/Stack...Exchange.Precompilation to precompile our cshtml views into the dlls. This is a build step that runs as part of the MSBuild/VS process on the web project.

We've been doing this for a while, but because it's a slow process we'd configured it (by environment variable) to only run on one node (as that's enough to flag up a failure). I've disabled this now and the node doesn't look to reset that often

I don't know whether it throws out a lot of output or something, or makes drastic changes elsewhere, but hopefully that's another clue!
Remco
#31 Posted : Sunday, August 27, 2017 11:11:35 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
This is an interesting component, though I don't think it's likely to be responsible for the disconnections .. unless it's doing so indirectly as you've suggested (i.e. trace output size). Do the disconnections still seem to happen consistently after a large message as observed earlier? I'd be interested to know if the buffering adjustment I implemented has had any effect at all.
MatthewSteeples
#32 Posted : Tuesday, September 05, 2017 1:33:56 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 48
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
It would appear that this program is causing a significant amount of trace output. All along the lines of

WARNING - ..\..\..\..\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets (1987, 5): MSB3245: Could not resolve this reference. Could not locate the assembly "System.Console, Version=4.0.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a, processorArchitecture=MSIL". Check to make sure the assembly exists on disk. If this reference is required by your code, you may get compilation errors.

and

WARNING - ..\..\..\..\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise\MSBuild\15.0\Bin\Microsoft.Common.CurrentVersion.targets (1987, 5): MSB3243: No way to resolve conflict between "System.Diagnostics.FileVersionInfo, Version=4.0.2.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" and "System.Diagnostics.FileVersionInfo, Version=4.0.1.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a, processorArchitecture=MSIL". Choosing "System.Diagnostics.FileVersionInfo, Version=4.0.2.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" arbitrarily.
Remco
#33 Posted : Wednesday, September 06, 2017 1:37:34 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,872

Thanks: 649 times
Was thanked: 756 time(s) in 721 post(s)
Setting the log verbosity to Summary may prevent these warnings from being processed into the logs. It's still not clear to me why you are experiencing disconnection problems. Considering the buffering fix didn't have any effect for you, I have a stronger feeling that the connection instability is causing by something external to NCrunch.
Users browsing this topic
Guest
2 Pages<12
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.082 seconds.