Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Distributed Node Heuristics
9swampy
#1 Posted : 2 years ago
Rank: Member

Groups: Registered
Joined: 8/8/2016(UTC)
Posts: 13
Location: United Kingdom

Hi,

Most of the team (/company I've just joined) has stopped using nCrunch primarily due to load times and they've reverted to just CI feedback.

I've got nCrunch running fairly well on a huge solution with an equally huge unit test run time, once it's up and running it's not too much of a resource hog but getting to the baseline takes way too long. I'd hoped nCrunch distribution would speed that up a lot...

They've NOT previously used Node servers and that has worked well for me before, but on much smaller solutions, and also a lot less coverage.... Another story.... so that's what I'm trying.

On this huge solution though I'm not seeing the nodes kick in and offload much of the work. Fair, the nodes are seriously underpowered by comparison to the local machines but during local initial build the locals are 100% busy while the nodes I've configured jump in and out once in a literal blue moon and pick up a few tests. If I can evidence the load balancing offset & speeding up reaching initial baseline then I'll have a case.

In the several minutes I've been writing this post local's processed 30k odd of 56k odd unit tests but I only noticed the node kick in for a few tests a couple of times, for a few seconds.

Fair, I've just noted I'm missing a dependency on the node (Net 4.8) which I'll go fix, but more than half of aforementioned tests would have been runnable.

TL;DR - is there any documentation on the heuristics of how the load gets spread or any configuration I should go-to-first adjust?

KR J
Remco
#2 Posted : 2 years ago
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,165

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Hi, thanks for posting.

Regarding the long startup times that have been causing issues for you, can you provide any more information on which activities during startup seem to be taking a long time? Broadly, there are 3 different things the engine needs to do before it can run tests:
1. Load the projects that are open in VS (progress is described by the 'Loading projects' indication shown on any of the major tool windows)
2. Build the projects and analyse their assemblies for tests (happens once the engine has loaded everything, but needs to be done before tests can be executed. This is shown with all the build/analysis tasks in the processing queue).
3. Bootstrap the engine and load the cache file (described by all the tasks displayed on the loading indicator except for the loading of projects)

Possibly it's a combination of all 3, but knowing whether any of the above is more significantly contributing to the problem will help us with our own performance optimisation.

Regarding the nodes not taking much of the load off local, do you notice this more on the initial run-through of the engine, or does it seem to be a pattern across your entire NCrunch session?

The engine doesn't usually initiate connections to the nodes until it has completed its bootstrap sequence. Because there is some synchronisation that needs to happen before a node can service any tasks, this can mean that the nodes will lag a bit behind local in their initial responsiveness.

The integration with grid nodes is structured as a pull-based system. Once a node is synchronised and ready for work, it will send a request to a connected client to provide it with work. On receiving this request, the client examines the contents of its processing queue and will send through a set of tasks that the node is able to process. Because this transfer of tasks involves network overhead, the offloading of work onto a node is always slower and less responsive that the client's local processor (which is able to pull tasks straight out of memory). To be able to complete a task, the node also needs to be able to transfer all results (including coverage and trace data) over the network. The faster the connection, the less of an issue this generally is.

One potential issue that can arise is related to the capacity of the engine itself to coordinate and process tasks. In your use case, the engine needs to coordinate and process the results of 56k tests over the space of a single pass. That's a lot of tests, and each one has its own set of coverage data that needs to be merged and mapped into a local database. Because all this data is generated using background runners (and grid nodes), it's possible for the background runners to outrun the engine's ability to coordinate the work and process results in a timely way.

It's possible to track this by keeping your cursor over the NCrunch spinner in the corner of your IDE, where you'll be able to see the core engine load and the number of tasks being processed. If the core load is sitting at or near 100% but the tasks being processed is not near the max of the bar, this means the engine is overloaded and adding more capacity (in terms of grid nodes) is not likely to be beneficial. In this situation, because the nodes are sitting down the end of a network connection, they tend to have less priority than the local processor and are more likely to be underutilised. Unfortunately, we don't have any firm guidelines on how much load the engine can handle, as this is extremely variable depending on the environment and the characteristics of your solution. If you have a vast number of fast executing tests with each covering a reasonable amount of code, you are more likely to hit the limits of the engine sooner than if your tests are chunkier and more isolated.

In terms of getting a general overview of your run, I highly recommend doing an export of the Timeline report using the export button on the Tests Window after you've completed a full pass of all your tests.
talbrecht
#3 Posted : 2 years ago
Rank: Member

Groups: Registered
Joined: 5/10/2019(UTC)
Posts: 20
Location: Germany

Thanks: 8 times
Was thanked: 3 time(s) in 3 post(s)
Hi,

small hints from a satisfied NCrunch user:

1. Upgrade your network (if needed/possible). In the past we had 100 Mbit and large solutions took a while being loaded for testing. Nowadays our developer network is 1Gbit, switched and with 10 Gbit towards the (grid) servers. The upload time is now neglectable. Only first build time is left.
2. Offload as much as possible to your grid servers by adapting the CPU and fast track settings accordingly. Often I just let run 2 CPU cores locally for fast track and the tests must run on the grid.
3. Make use of custom engine modes to streamline your NCrunch experience improving the feedback loop. For example initially I let run all tests to be sure that everything is ok and then switch to a custom engine mode testing only the module(s) I'm currently working on, so that feedback is almost instantly. Finally I run all tests again before committing to ensure nothing is broken accidentally.

Best regards and have a nice day,
Thomas
1 user thanked talbrecht for this useful post.
Remco on 5/31/2022(UTC)
9swampy
#4 Posted : 2 years ago
Rank: Member

Groups: Registered
Joined: 8/8/2016(UTC)
Posts: 13
Location: United Kingdom

Just to update thx for the pointers to the Timeline reports. Helped me work out what was going on. Didn't get me to a satisfactory outcome, yet, but made progress.

@Talbrecht; you weren't completely offpiste but i'd already worked out things were a lot worse over a remote VPN connection; that's a whole other conversation but I can confirm the local network's not the biggest problem; when I'm co-located. Theres more in your comments that would be worth getting back to but...

Our local's have NVMe drives, the VM running the node's not got the same performance locally; oddly enough. Initial load on the Node's just not on the same planet. Off company network I've had good success following the recommendation to run caches on a RamDisk; pushing for clearance for same in Company pending InfoSec...

Not a dead end, just pending... will revert one day...

Remco
#5 Posted : 2 years ago
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,165

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Handling huge solutions is a challenge for any tool, especially one like NCrunch. That's not saying we can't handle it, but rather that we're always hungry for more information on where the performance of the engine may be falling short. I'm quite interested in what you've learned from the timeline report as this might help us to prioritise future optimisation.

There's an update going out today that may improve things for you. It optimises the unpacking of coverage data when the engine is initialising. When lots of coverage points are involved, this significantly improves engine start times.
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.060 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download