Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

2 Pages12>
Grid task capacity and task queue
Marcello
#1 Posted : Tuesday, June 27, 2023 12:57:42 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
I have a grid node with 48 Logical CPU Cores where I set Task Capacity equal to 48. I noticed that when a colleague is running tests on this grid and all 48 tasks are running, I need to wait until the end of all those tests before that my projects start to build on NCRunch node.
Is this by design or there is a particular configuration that allows finding a good compromise to let the grid distribute its task capacity between me and my colleagues?

That's a problem sometimes because I cannot use the NCRuch grid till my colleague end the tests.

Any idea or suggestion would be appreciated, thanks.

Currently, we are using NCruch v.4.17.0.7.
Remco
#2 Posted : Tuesday, June 27, 2023 11:24:37 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Hi, thanks for sharing this problem.

The grid node is intended to follow a round-robin pattern when requesting tests to execute from clients. This means that normally the capacity should split more or less fairly between the nodes.

Is this a new problem you've encountered in v4.17? Or have you seen it before?

Do the tests involved have a particularly long execution time?
Marcello
#3 Posted : Wednesday, June 28, 2023 7:25:57 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Hi Remco,

No, we noticed it with 4.16 as well.

I've just checked the execution time of our unit tests and I see there is one that is taking 11 minutes (for sure we need to change/improve it!).

For the rest, max execution time is 2m50s

https://imgur.com/XD1dFDJ
Remco
#4 Posted : Wednesday, June 28, 2023 12:21:56 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Thanks for these extra details.

I've done some additional testing on my end to check whether we broke anything here with v4.17, but all seems to be well. All of our test nodes are balancing their requests between clients.

When you are connected and waiting for the busy remote node to run your tests, do you see in the corner NCrunch popup window the black 'Grid Shared' tasks that the node is running on behalf of others?

Also, is the pattern around the node waiting for these tests very consistent? Does it also affect your colleague if you get to the node first?
Marcello
#5 Posted : Wednesday, June 28, 2023 1:20:36 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Hi Remco,

Could you please take a look at this video: https://www.screencast.com/t/rzayHyHF

After 10 minutes, I am still unable to run tests on the NCrunch node. Specifically, at 05:57, after the NCrunch snapshot has been initialized, I attempted to reload a couple of projects to simulate the reported issue. As you can see, when my colleagues run their tests, no assembly is being built yet.

It's possible that we have misconfigured something, but please let me know what information you need in order to assist us in improving the use of NCrunch.
Remco
#6 Posted : Thursday, June 29, 2023 12:58:55 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Thanks for sharing this video.

It looks like I'm going to need to see the log files from the grid node over the time period where it isn't cycling the work correctly between clients.

It should be possible to retrieve these from the node if you turn on the 'Log to file' setting. Make sure you set the 'Logging Verbosity' to 'Medium'.

If you can grab the logs over the period of time where the node is appearing to malfunction, please ZIP them up and send them through to me. I'm hopeful that the logs will reveal the reason for this behaviour.

Something else I think you should check is whether the node is suffering from CPU starvation. In your video, the (re)connection and synchronisation between client and node seemed to be very sluggish while the node was busy with other tasks. It's possible the node is having trouble keeping up. Note that I don't believe this is related to the work allocation issue that you've demonstrated.
Marcello
#7 Posted : Friday, June 30, 2023 7:18:05 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Hi Remco,

I just sent you a zip with log files captured with Medium verbosity.

Scenario was the following:

1. my colleague Simone had already initialized NCrunch on his machine
2. I opened the solution and NCrunch start initialization - all fine
3. I click 'Reload and Rebuild' for projects named Autodesk - nothing happens - see for example log file NODE 2023-06-30 08-51-24.656
4. I tried to start a test but nothing happens, it remains in the queue with the other tasks on my machine
5. I asked my colleague Antonio to start NCrunch on his machine - loading and building ended ok for him but my tasks were still there, nothing to do for me :( - see for example file NODE 2023-06-30 08-57-23.35

This short video shows the log file creation when my tasks remain always in the queue :(
https://www.screencast.com/t/h1X6EtN3bm

I hope this helps you to find out what is going wrong here :crossfingers

P.S. I forgot to say that NCrunch grid node machine was not stressed during these operations, as shown in this picture
https://imgur.com/zjLXJ4y
Marcello
#8 Posted : Friday, June 30, 2023 7:41:17 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Hi again Remco,

Maybe I found a clue.

We have projects with multiple target frameworks (net472, net6.0-windows). If I ignore net6 targets from NCrunch configuration, the issue disappears. The same if I ignore net472 and I leave net6 only.
https://imgur.com/HRbakNr

Let me know your thoughts, please.
Remco
#9 Posted : Saturday, July 1, 2023 4:11:28 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Thanks for the extra details.

I'm looking into this now and I have a few things I'd like to try. I'll update you in the coming week when I have more information.
Remco
#10 Posted : Wednesday, July 5, 2023 4:41:18 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Thanks for your patience with this issue.

I've managed to identify some issues around task allocation to grid nodes that were most likely introduced in the v4.17 build.

We're working on solutions for these and I will let you know as soon as I have a fixed build available for you to try.
Remco
#11 Posted : Thursday, July 6, 2023 4:26:39 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Marcello
#12 Posted : Monday, July 10, 2023 10:14:57 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Hi Remco,

I tried build 4.18.0.1 but the issue is still there :(

Once waited to let NCruch build all the projects (both net472 and net6) and after running a couple of tests (both net472 and net6), I click 'Reload and Rebuild' for projects named Autodesk and nothing happens as shown in the below image

https://imgur.com/xmMWH2J

Let me know if I can do something more to help you investigate the issue, thanks.
Remco
#13 Posted : Monday, July 10, 2023 12:52:56 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Sorry to hear this is still happening. Have you noticed any change in the behaviour of the issue with the new build? I was able to reproduce the problem as you described it and was fairly certain I'd fixed it. I'm wondering if there is another, more evasive issue hiding in there.
Marcello
#14 Posted : Tuesday, July 11, 2023 9:51:52 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
As far as I can see, now If I make a code change in Autodesk project, all the projects are built correctly and I can run the tests, but I confirm that when I do 'Reload and Rebuild' nothing happens.

I recorded another video showing it. It's played 2x fast. On the left, you can see the Task Manager of the NCRunch server grid.
https://www.screencast.com/t/hTPuyg68

At minute 1:30 I start to make some tests, before changing the code in Autodesk project and later reloading the project.

Let me know your thoughts, thanks

P.S.
In the meantime I was writing this post, I noticed that Visual Studio was unresponsive, so I recorded another video showing it: https://www.screencast.com/t/3IQ9mfAspzt
I had to kill NCruch process on my machine and enable it again to start working again. I don't know if this can be helpful for this issue.
Remco
#15 Posted : Wednesday, July 12, 2023 1:20:13 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Thanks for sharing these videos - They are such a huge help in understanding these issues.

It's good to see in the video that the node is now properly sharing its resources between the clients (seen in that it picks up tests when you queue them up). This means that the initial reported problem is now solved.

The lack of response to the reloading of projects is strange. This should be going through the same handling process as other changes to the queue, and I note that the node still responds as expected to normal build and test requests. I haven't been able to reproduce this behaviour on my side, so I suspect there may be a race condition or something in the environment that causes it to surface in some cases but not others. Can I safely assume that other members of your team are experiencing the same behaviour?

I believe this problem is most likely on the client side of the connection. Would it be possible for you to submit an NCrunch bug report immediately after you've reloaded a project and the node has failed to pick it up? I'm hopeful the log in the report will yield some useful clues here.

Not related: I would recommend turning off the xunit parallel execution for that group of tests kicking up errors on the node. You may be able to do this with an #if !NCRUNCH condition. These sorts of problems can destabilise the TestHost runner responsible for the tests involved. You might find that the results you get for these tests are non-sensible under NCrunch as long as xunit is running them over multiple threads.

Regarding the VS hang issue, that's not a fun one. It's interesting that the IDE did eventually recover. If you get this happening regularly, would you mind loading another VS instance in the background and attaching its debugger to the first instance, then breaking into the process and grabbing a call stack for the main thread? If it's something NCrunch is doing then I'd quite like to fix it.
Marcello
#16 Posted : Wednesday, July 12, 2023 2:22:44 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
You're welcome, I'm glad to hear that videos were helpful. Sometime videos or pictures are better than words :)

I've already sent NCruch bug report as you asked. I confirm that the same behavior can be replicated by my colleagues.

Regarding parallel execution, do you mean I should add #if !NCRUNCH condition to this attribute?
https://imgur.com/KI4cNyE

I can't reproduce VS hang again, in case, I will try to get the stack trace.
Remco
#17 Posted : Thursday, July 13, 2023 1:47:05 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Marcello;16751 wrote:

I've already sent NCruch bug report as you asked. I confirm that the same behavior can be replicated by my colleagues.


Thanks for sending through the report. I've managed to trace the issue all the way up to the UI. I had to look very closely at your example video before I picked up on what was happening here. When you reload the projects, you are selecting both multi-targeted variants of the same project and telling the engine to reload both of them (which makes sense as I expect you want them both reloaded). In the engine we actually have logic in place to automatically reload both projects if either is reloaded. Because you're choosing to reload both at once, this ends up with each of the two projects being reloaded twice at the same time. This creates downstream race conditions that destabilise the engine.

I'm going to introduce a fix to the UI so that we handle this properly instead of falling over downstream, but for the time being, I recommend reloading only one of the two projects at a time. Note that this will also reload both of them, so it should be faster for you :)

Marcello;16751 wrote:

Regarding parallel execution, do you mean I should add #if !NCRUNCH condition to this attribute?
https://imgur.com/KI4cNyE


Correct :)
Marcello
#18 Posted : Thursday, July 13, 2023 7:30:24 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Hi Remco,

Thanks for the details. I confirm that reloading only one project did the trick, it works like a charm :)

Let me know if you will release a new build with the fix, I'll be glad to test it of course.
Remco
#19 Posted : Thursday, July 13, 2023 1:25:16 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,123

Thanks: 957 times
Was thanked: 1286 time(s) in 1193 post(s)
Marcello;16754 wrote:

Thanks for the details. I confirm that reloading only one project did the trick, it works like a charm :)


Thanks for confirming this :)

Marcello;16754 wrote:

Let me know if you will release a new build with the fix, I'll be glad to test it of course.



We're making final changes to v4.18 and will probably push this public next week. The fix will be included.
Marcello
#20 Posted : Tuesday, July 25, 2023 2:06:08 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 9/16/2019(UTC)
Posts: 70
Location: Italy

Thanks: 1 times
Was thanked: 4 time(s) in 4 post(s)
Tried v4.18.0.1 and the issue is still there.

I wait for your feedback to know the build that will include the fix, thanks.
Users browsing this topic
Guest
2 Pages12>
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.106 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download