Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

2 Pages<12
Grid Server keeps disconnecting
Remco
#13 Posted : Monday, August 07, 2017 11:14:43 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,717

Thanks: 622 times
Was thanked: 725 time(s) in 690 post(s)
MatthewSteeples;10933 wrote:

All devices are fully up to date in terms of Windows Updates (which now includes .NET Frameworks) and Visual Studio editions. Got a different error today in the Distributed Processing Pane


I'm certain this is something else. I don't think this is likely to be a real issue here. It looks like the task runner process was unexpectedly terminated (which can happen in the event of stack overflows, unstable test code, etc). I don't think it's worth chasing this particular problem unless you're seeing it disturbingly often and it seems to be having an effect on the engine.


MatthewSteeples
#22 Posted : Thursday, August 10, 2017 4:47:56 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 38
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Is there anything we can do to send any more debug information through? This is becoming quite an issue for us now, and we're going to have to investigate rolling back if we can't get it resolved soon :(
Remco
#23 Posted : Thursday, August 10, 2017 8:22:59 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,717

Thanks: 622 times
Was thanked: 725 time(s) in 690 post(s)
MatthewSteeples;10955 wrote:
Is there anything we can do to send any more debug information through? This is becoming quite an issue for us now, and we're going to have to investigate rolling back if we can't get it resolved soon :(


It may be worth giving the build below a try:

http://downloads.ncrunch.net/NCrunch_GridNodeServer_3.11.0.2.msi
http://downloads.ncrunch.net/NCrunch_GridNodeServer_3.11.0.2.zip

This build contains some extra error handling and fallback code to deal with a couple of problems discovered by Grendil. I don't think you're experiencing the same problem, as your issue seems to be connectivity related rather than a functional issue. But it seems like it would be worth a try. The protocol used in this build is compatible with v3.10 clients.

Unfortunately I haven't been able to reproduce the random disconnection issue nor have I been able to find any potential cause for it through code review. Because the problem itself is occurring down the stack (i.e. in sockets) there is no further error information to capture and analyse.

You mentioned earlier that you'd identified a pattern, with the problem appearing on some nodes but not others. Does this still seem to be the case?
MatthewSteeples
#24 Posted : Thursday, August 10, 2017 10:02:44 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 38
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Sadly I wasn't able to find out what's different about the nodes. I verified versions of installed software, Windows, NCrunch, how it had been installed etc but nothing stood out

Same problem on 3.11.0.2, although I've got a few more log messages and a potential query. You can see at 22:55:22.5025 that it tries sending a 3mb message, and then at 22:55:22.5055 (0.003 seconds later) it tries sending another message. Could it be that the socket is busy when it's sending the message, or does the 22:55:22.5045 line indicate it was flushed? My network isn't fast enough to do 3mbyte in 0.002 seconds!

[22:55:22.4525-Core-50] Sending message 'nCrunch.Core.Grid.Messages.NodeBuildCompletedMessage' to remote client at address: 192.168.2.2:37982
[22:55:22.5025-Core-50] Sending non-file based message nCrunch.Core.Grid.Messages.NodeBuildCompletedMessage with size 3013921
[22:55:22.5025-Core-50] Writing 3013921 bytes to open connection
[22:55:22.5045-Core-50] Event [NodeTaskProcessedEvent:Building Ledgerscope.Core] is being processed on Core thread with subscriber: NodeWorkRequester.
[22:55:22.5045-Core-50] Requesting more work from the client
[22:55:22.5045-?-49] Queued data was sent
[22:55:22.5045-Core-50] Sending message '[NodeWorkRequest:CurrentSnapshotVersion=1,AvailableWorkers=4,ResourcesInUse=[Resources:X=, I=]]' to remote client at address: 192.168.2.2:37982
[22:55:22.5055-Core-50] Sending non-file based message [NodeWorkRequest:CurrentSnapshotVersion=1,AvailableWorkers=4,ResourcesInUse=[Resources:X=, I=]] with size 35
[22:55:22.5055-Core-50] Writing 35 bytes to open connection
[22:55:22.5065-?-49] Ceasing to send messages because of an error (was the connection closed?): System.IO.IOException: Unable to write data to the transport connection: The I/O operation has been aborted because of either a thread exit or an application request. ---> System.Net.Sockets.SocketException: The I/O operation has been aborted because of either a thread exit or an application request
at System.Net.Sockets.Socket.EndSend(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndWrite(IAsyncResult asyncResult)
--- End of inner exception stack trace ---
at System.Net.Sockets.NetworkStream.EndWrite(IAsyncResult asyncResult)
at nCrunch.Core.Grid.Connectivity.GridMessageSender.(IAsyncResult )
[22:55:22.5075-?-49] Handling network error by disposing of connection to 192.168.2.2:37982
[22:55:22.5075-?-49] Closing connection to 192.168.2.2:37982 using CloseConnectionMessage
[22:55:22.5075-?-49] Sending non-file based message nCrunch.Core.Grid.Messages.CloseConnectionMessage with size 8
[22:55:22.5075-?-49] Writing 8 bytes to open connection
[22:55:22.7596-?-49] Queued data was sent
[22:55:22.7606-?-49] Cleaning up connection to 192.168.2.2:37982
[22:55:22.7616-?-49] Handling closure of connection to 192.168.2.2:37982
[22:55:22.7616-?-49] Server-side connection to 192.168.2.2:37982 has been lost
[22:55:22.7616-?-49] Publishing Event: [NetworkServerDisconnectedEvent:192.168.2.2:37982]
[22:55:22.7616-?-49] Event [NetworkServerDisconnectedEvent:192.168.2.2:37982] is being published on thread CoreThread to subscriber: NodeActivityTracker.
[22:55:22.7626-?-49] Event [NetworkServerDisconnectedEvent:192.168.2.2:37982] is being published on thread CoreThread to subscriber: GridNodeServer.
[22:55:22.7626-?-49] Event [NetworkServerDisconnectedEvent:192.168.2.2:37982] is being published on thread CoreThread to subscriber: NodeWorkRequester.
Remco
#25 Posted : Friday, August 11, 2017 2:15:47 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,717

Thanks: 622 times
Was thanked: 725 time(s) in 690 post(s)
MatthewSteeples;10961 wrote:

Same problem on 3.11.0.2, although I've got a few more log messages and a potential query. You can see at 22:55:22.5025 that it tries sending a 3mb message, and then at 22:55:22.5055 (0.003 seconds later) it tries sending another message. Could it be that the socket is busy when it's sending the message, or does the 22:55:22.5045 line indicate it was flushed? My network isn't fast enough to do 3mbyte in 0.002 seconds!


The 3meg transfer here actually involves writing the data into the socket's buffer. But I do find it very interesting that the disconnection always seems to happen at this point. This is why I thought it would be worth turning the compression back on, to see if using different data might make any difference.

How fast is your connection?
Remco
#26 Posted : Friday, August 11, 2017 5:10:47 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,717

Thanks: 622 times
Was thanked: 725 time(s) in 690 post(s)
I'm working on a theory at the moment that perhaps the size of the message is blowing the socket buffer. 3MB is unusually large for a build result message, probably this is quite a big project. Turning off instrumentation for this project might even work around the problem.

There may be a code change I can make to limit the size of these kind of messages being chunked into the buffer. I'll let you know as soon as I have a working implementation.
Remco
#27 Posted : Sunday, August 13, 2017 10:51:28 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 4,717

Thanks: 622 times
Was thanked: 725 time(s) in 690 post(s)
MatthewSteeples
#28 Posted : Tuesday, August 15, 2017 5:11:50 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 10/28/2014(UTC)
Posts: 38
Location: United Kingdom

Thanks: 3 times
Was thanked: 3 time(s) in 3 post(s)
Hi Remco,

That does appear to do something. One of the nodes that previously restarted now seems to keep running. I'll keep an eye on it and let you know if anything happens.

Thanks,
Matthew
1 user thanked MatthewSteeples for this useful post.
Remco on 8/15/2017(UTC)
Users browsing this topic
Guest (2)
2 Pages<12
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.057 seconds.