Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Grid Node Service keeps stopping
Phonesis
#1 Posted : Monday, May 23, 2016 9:16:57 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 4/14/2016(UTC)
Posts: 32
Location: United Kingdom

Was thanked: 3 time(s) in 3 post(s)
We have three grid machines set up. Two of these seem stable and rarely go down. The third, however, is dying once every few days or so. It seems like the grid node service is stopping and cannot be restarted automatically. I have to go to Windows Services and either restart it manually or kill it and start it (sometimes it gets stuck at stopping).

The only thing that seems different is that this is a Windows Server 2012 R2 machine with 4gb of ram. The others are running on a different Windows OS I think (7).

Any chance there's a compatibility issue? Any logs I can check/enable?
Remco
#2 Posted : Monday, May 23, 2016 10:12:25 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,986

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Hi, thanks for sharing this issue.

At present there are no known problems that can cause this.

Do you, by chance, have Visual Studio installed on the system running the defective node? If so, I'm wondering if you might like to try attaching a debugger to the grid node service when it locks up. The list of running threads (and relevant stack traces) may give us some useful information about it.

Something also worth trying is turning on logging on the grid node using the grid node configuration tool. Logging can consume quite a large amount of disk space over time, but the bottom of the log file may be quite revealing if we can get the node logging at its point of failure.
Phonesis
#3 Posted : Monday, May 23, 2016 11:08:30 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 4/14/2016(UTC)
Posts: 32
Location: United Kingdom

Was thanked: 3 time(s) in 3 post(s)
Remco;8756 wrote:
Hi, thanks for sharing this issue.

At present there are no known problems that can cause this.

Do you, by chance, have Visual Studio installed on the system running the defective node? If so, I'm wondering if you might like to try attaching a debugger to the grid node service when it locks up. The list of running threads (and relevant stack traces) may give us some useful information about it.

Something also worth trying is turning on logging on the grid node using the grid node configuration tool. Logging can consume quite a large amount of disk space over time, but the bottom of the log file may be quite revealing if we can get the node logging at its point of failure.


Thanks Remco. VS is installed so will enable the debugger for it and also enable logging in the tool. What level of logging verbosity shall I set it for? Summary, high detail, detailed?
Remco
#4 Posted : Monday, May 23, 2016 11:24:34 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,986

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Phonesis;8758 wrote:

Thanks Remco. VS is installed so will enable the debugger for it and also enable logging in the tool. What level of logging verbosity shall I set it for? Summary, high detail, detailed?


Detailed, if possible. This will then include all the build and messaging logs. If the disk consumption is too high to keep up, even Summary might be interesting. If the node is blowing up because of an exception in the network handling, even Summary will record the error.
Phonesis
#5 Posted : Monday, May 23, 2016 11:39:06 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 4/14/2016(UTC)
Posts: 32
Location: United Kingdom

Was thanked: 3 time(s) in 3 post(s)
Remco;8759 wrote:
Phonesis;8758 wrote:

Thanks Remco. VS is installed so will enable the debugger for it and also enable logging in the tool. What level of logging verbosity shall I set it for? Summary, high detail, detailed?


Detailed, if possible. This will then include all the build and messaging logs. If the disk consumption is too high to keep up, even Summary might be interesting. If the node is blowing up because of an exception in the network handling, even Summary will record the error.



Ok great. Detailed is enabled. Unfortunately can't actually run VS on that machine though but hopefully the logging will be enough. Will reply here again if/when the service dies again and attach the log.
Remco
#6 Posted : Monday, May 23, 2016 11:42:45 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,986

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Phonesis;8760 wrote:

Ok great. Detailed is enabled. Unfortunately can't actually run VS on that machine though but hopefully the logging will be enough. Will reply here again if/when the service dies again and attach the log.


Thanks! Hopefully the log will tell us something useful.
Phonesis
#7 Posted : Wednesday, May 25, 2016 12:19:31 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 4/14/2016(UTC)
Posts: 32
Location: United Kingdom

Was thanked: 3 time(s) in 3 post(s)
Remco;8762 wrote:
Phonesis;8760 wrote:

Ok great. Detailed is enabled. Unfortunately can't actually run VS on that machine though but hopefully the logging will be enough. Will reply here again if/when the service dies again and attach the log.


Thanks! Hopefully the log will tell us something useful.



Hi Remco, not managed to get a log of this occuring yet but have implemented a script that runs in background of the machines we use and polls the NCrunchGridService ensuring its status is Running.

If it is no longer running, the script will attempt to start it again. This is using the Microsoft ServiceController class.

Earlier, the script detected that a service stopped running on a machine and re started it ok. However, even though it got restarted and its status changed to Running again the machine was not found by the NCrunch Distributed Processing screen in VS or on the grid controller machine. It seems a full reboot is only solution right now for the machine to come back online.

Any ideas why this is occurring? Should a simple restart of the service work in theory?
Remco
#8 Posted : Wednesday, May 25, 2016 12:59:34 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,986

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Phonesis;8771 wrote:

Hi Remco, not managed to get a log of this occuring yet but have implemented a script that runs in background of the machines we use and polls the NCrunchGridService ensuring its status is Running.

If it is no longer running, the script will attempt to start it again. This is using the Microsoft ServiceController class.

Earlier, the script detected that a service stopped running on a machine and re started it ok. However, even though it got restarted and its status changed to Running again the machine was not found by the NCrunch Distributed Processing screen in VS or on the grid controller machine. It seems a full reboot is only solution right now for the machine to come back online.

Any ideas why this is occurring? Should a simple restart of the service work in theory?


I think we really need to get more information about the state of the service after it's crashed before we can draw any conclusions on what may be happening with it. The NCrunch code running in the service is quite thorough with its error handling, so it is rather troubling that the service stops responding entirely when this happens (even to the servicecontroller). Do you see anything interesting in your windows event viewer?
Phonesis
#9 Posted : Wednesday, May 25, 2016 2:19:38 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 4/14/2016(UTC)
Posts: 32
Location: United Kingdom

Was thanked: 3 time(s) in 3 post(s)
Nothing of note in the event viewer. Have enabled logging on all our machines now so will let you know if I catch anything.
1 user thanked Phonesis for this useful post.
Remco on 5/25/2016(UTC)
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.055 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download