Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Why won't tests run on distributed node?
MarcChu
#1 Posted : Wednesday, June 8, 2022 3:28:55 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
I'm trying to determine why certain projects won't run on my distributed grid nodes.

I have certain test projects that will build and run just fine on my local machine. However, in the NCrunch Processing Queue window, these projects will always display as "Pending" for any other server, for both the "Build Assembly" and "Analyse Assembly" tasks, and they never proceed to "Processing".

If I were to select a single test from one of these projects and attempt to run it on a selected server, it would show that it's queued to run, but it will never run.

I've confirmed that these projects don't have any Required Capabilities. They *should* be able to run on the other servers.

I'm not quite sure where to look to solve this problem. I'm not sure if this is an environment issue, or if there is something in the particular .csproj file that is causing this, or if there is some sort of NCrunch configuration setting (either general or project-specific).
Remco
#2 Posted : Wednesday, June 8, 2022 11:32:15 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi, thanks for sharing this.

Can you try running with the node using a fresh project/solution built from one of the VS project templates? This will help to establish whether the problem is related to the configuration in your solution or whether it's something specific to the node.

Also, in the Distributed Processing Window, can you check what the reported Task Capacity is for the troublesome node?
MarcChu
#3 Posted : Thursday, June 9, 2022 1:06:02 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
Well, all of my nodes have been able to run some subset of the tests. So, while I haven't yet done a new project from a template to test this specific thing, have little reason to doubt that those tests would successfully build and run.

The problem I'm having is with only a select few test projects in my solution (which translates to about 2 dozen out of quite possibly 100+). Many of these projects are integration test projects, and they have Required Capabilities such that I don't expect or want them to run on the nodes (although I'm not clear on whether those projects should still be showing up on the Processing Queue page as Pending tasks). So, I'm not too concerned about those.

So I feel like this must have something to do with properties specific to those projects. One thing I do know about the projects that are troublesome is that some of them use SpecFlow features. But as far as I know, the dependency for building these features into test files should be self-contained within the projects using a Nuget package.

I've been turning on as many logging settings as I can find in NCrunch in some attempt to see what logic it's using to decide what to build and on what servers, but I haven't yet found any output that gives me any hints as to why these few projects won't move on from Pending.

Of my 3 distributed nodes, they have task capacities of 24, 8, and 2.
Remco
#4 Posted : Thursday, June 9, 2022 11:45:39 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
Thanks for sharing these extra details.

Something that should help here is to turn on the "Required Capabilities" column in the Processing Queue Window (right click the column headers). This will show you which of the tasks have capabilities assigned to them and therefore won't be runnable on the nodes. Because NCrunch's configuration system is fairly complex, this will help you to identify projects that may have been assigned capabilities accidentally.

It's worth considering that when a project has a required capability, the node will be unable to build it and also unable to build any other project that depends on it. This means that assigning a capability to a project at the bottom of your dependency tree can cause the node to avoid building any projects in the solution.

It is normal behaviour for the processing queue to contain tasks that cannot be processed by nodes due to their required capabilities.
MarcChu
#5 Posted : Thursday, June 16, 2022 2:03:44 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
Okay, yes, it seems that the test projects in question were themselves depending on test projects that had Required Capabilities defined. Now that I've refactored to get rid of those dependencies, the tests are building and running as expected. Thank you.


I do have another issue that's somewhat unrelated, but that's been vexing me for awhile.

Very often, when my tests are running, an entire fixture of tests will run in a single NCrunch process (which means they'll run sequentially). This could result in dozens or even hundreds of tests using only a single process and thread, even though I've got the capacity to run dozens in parallel, whether on my local server or on distributed nodes.

This has the unwelcome effect of having to wait potentially several minutes before getting any feedback on what is happening with the tests in that particular fixture.

If I select a subset of the tests in those fixtures, the same thing happens, they'll run in a single process/thread. However, if I then select another subset of those tests and run them, then those tests will run in a parallel in a separate process.

I'm trying to figure out if this behavior is due to something that I've done, and if I can somehow modify my settings or my tests to prevent this from happening, and to instead run these tests in parallel. What I can say is that these tests don't have any "Exlusively Uses" defined for them.
Remco
#6 Posted : Friday, June 17, 2022 12:13:52 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
It's great to hear that's fixed the capabilities problem.

The problem you're describing with the batching has some similarities with another problem a user is currently reporting.

Basically, NCrunch uses the last reported passing execution time of a test to determine how it should batch the test for parallelization. Ideally, we'd want to place each test in its own 'batch' so that we have perfect concurrency. Unfortunately, the architecture of test frameworks doesn't really make this possible, as the overhead that goes with setting up the execution would make such an approach extremely inefficient. So what NCrunch does is group together large numbers of tests with very low execution times and run these inside a single batch/task. For this to work, NCrunch's expected execution time of each test needs to be fairly accurate. If a test has a normal execution time of 5 minutes but NCrunch thinks it will take 80ms, then it'll be batched with lots of other tests and the overall efficiency of the run with be degraded.

For NCrunch to lose track of a test's normal execution time, something must be causing this to happen. Is it possible you have something in your code that might trigger a rapid artificial pass of a long running test? Or do your tests have an extremely variable execution time? If you're having ongoing problems with this, there must be some kind of event repeating that is causing NCrunch to lose accurate track of the normal execution times of your tests.

I recommend turning on the 'Expected Execution Time' column in the Tests Window. This is the time that NCrunch uses when batching tests, so it's worth keeping an eye on it when trying to establish what is causing this problem.

You can also force NCrunch to place tests in their own batch. A test labelled with IsolatedAttribute will always run in its own task. The batching logic also factors in ExclusivelyUsesAttribute and InclusivelyUsesAttribute, as it isn't possible for tests within a batch to make different use of resources. If you have tests that are consistently long running, you can use these attributes to work around the problem.
MarcChu
#7 Posted : Friday, June 17, 2022 2:17:53 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
I'm not seeing any big differences in expected vs. actual execution times. But if I'm reading you correctly, then this could be working as intended if my fixture simply has a LOT of fast executing tests. Can you quantify "very low execution times" for me?

As an example, I had one task run 286 tests. Some of these tests are sub-second in duration, most of them seem to be under 3 seconds, a select few of them will be up to about 20 seconds. I see that if I run one of the sub-second tests in isolation, it'll take 5+ seconds due to the boot-strapping/one-time code.

So it certainly makes sense to batch a lot of these together into a single task. But that many? Even at an average execution time of only a second, it will take nearly 5 minutes before I get any feedback on that set of tests. (As it is, total execution times for fixtures like this ends up at around 10-15 minutes).

As an aside, I just accidentally chose to run my entire suite of tests ("Run all tests visible here" with all passing and failing tests showing). I canceled out after a minute after seeing what I'd done, but in looking at the completed tasks, they seem to have been batched into more digestible chunks. Usually, I'm only allowing impacted tests to run, and/or running visible failing tests.

So, does this behavior make sense?
MarcChu
#8 Posted : Friday, June 17, 2022 6:39:56 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
OOOOH, hm, it seems that I've identified a use case where this happens: if an entire test fixture fails for some reason (in this case static setup for the test fixture), then every individual test in the fixture will fail very quickly. Upon the next attempt to run those failing tests, they'll all be batched in the same task, because the engine thinks they should be very fast executing tests.

I don't think this is the only situation where I run into this, but this is at least one of the situations. Is there any way I can mitigate this?

I might suggest that, if a test goes from passing to failing, then it probably makes more sense that the expected execution time should be based on the last time the test passed.
Remco
#9 Posted : Friday, June 17, 2022 11:43:59 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
MarcChu;16176 wrote:
OOOOH, hm, it seems that I've identified a use case where this happens: if an entire test fixture fails for some reason (in this case static setup for the test fixture), then every individual test in the fixture will fail very quickly. Upon the next attempt to run those failing tests, they'll all be batched in the same task, because the engine thinks they should be very fast executing tests.


I would say this seems like a viable bug. It isn't supposed to work that way. Can you confirm which test framework you're using? (also which version).

The batching needs to consider the execution time of the test fixture too. For example, if a test fixture has setup code that takes 30 seconds to run, this increases the likelihood that NCrunch will try to batch the tests within the fixture together, as otherwise we need to run the expensive fixture setup multiple times over a test run. Sadly there often isn't a perfect way to parallelize a complex test suite. It all comes down to trade-offs, but the behaviour you've described on the child execution times being reset is definitely wrong and I'd like to look closer at it.
MarcChu
#10 Posted : Wednesday, June 22, 2022 2:39:21 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
I'm using NUnit 3.13.2.

I can confirm that the tests that seem to cause the biggest problems don't have very expensive setup (relative to other tests that run fine). Actually, I just did a test run (they ran automatically as they were impacted by code changes). Three different fixtures ran, as three different tasks, containing 285, 291, and 286 tests. The Expected Processing Times on these tasks was 00:08:09.666, 00:08:23.093, and 00:14:57.182, respectively. The length of those expected times, alone, I would expect the engine to cause to be broken up into many parallel tasks. The interesting thing about these tasks is that the "Actual Processing Time" seemed to be WAY off, 00:40:24.300, 00:41:36.638, and 00:41:34.934. I wasn't keeping a sharp eye on the clock, but I think they must have finished in something much closer to the expected times. Also, the Expected vs Actual for the individual tests within those tasks seemed to match for most if not all.

I guess I should mention that we've built up a somewhat complex test infrastructure over the years, with TestBase<T>, UnitTest<T>, and Integration<Test> abstract base classes, which define abstract/virtual methods for overriding, and which get NUnit attribute decorations, for the purpose of centralizing structure and logic, and easily sharing behavior. Most every test fixture we have inherits from one of these base classes. This is just to point out that so many of the tests are doing a lot of the same setup logic, yet this problem only seems to affect a few fixtures.

The particular fixtures that are exhibiting this problem are themselves derived classes of a particular abstract base class that defines most of the actual test methods, which is why the test counts are so similar. This structure mirrors the structure of the production code classes, which have very similar structures, function, and constraints. So every tests defined in the base class is performed 3 times, once for each derived class. I wonder if this structure is somehow affecting how NCrunch is deciding to batch the tests within.

Another potential complication is that many of these tests are parameterized with a ValueSourceAttribute. These tests are checking, e.g., whether the expected behavior is exhibited for the current user when he has a specific role in the system, and so some of these tests could be run an upward of 30 times apiece. (This is much of the cause for those large total test counts.) I do expect that NCrunch may have to run a single parameterized test method up to X times in a single task. However, I definitely don't expect them all to run in a single task.

So there are some extra details to chew on. BTW, these are the same test fixtures that I mentioned in my last post about having failed quickly in a prior run. This time, all of these tests had succeeded in their prior run, so that's not the problem here (and may not be a problem at all).
Remco
#11 Posted : Wednesday, June 22, 2022 10:57:38 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
Those expected execution times are definitely way off. Do you have much in the way of load-on-demand behaviour through your codebase? I'm wondering how consistent the execution times for your tests are in general. It would be interesting if you are able to provide an export from the Tests Window containing the Expected Processing Times and Actual Processing Times (you can send this through the NCrunch contact form) if it's something you're able to submit.

Something that would be interesting to examine is the execution time for one of your typical tests in a new process vs. an already executing one. If you choose a test in your codebase where the expected time is way out, try running this test alone using NCrunch and making note of the execution time. Then go through the Advanced context menu for the test and choose to run it again in an existing process. Take note of the execution time. Is it very different? My suspicion is that the execution time of your tests may be heavily dependent on their execution sequence. Be sure to also check the execution times attached to the fixture (as these are just as relevant).

Also, just to check.. Are you making use of NCrunch.Framework.AtomicAttribute at all?
MarcChu
#12 Posted : Tuesday, June 28, 2022 5:15:48 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
I can export the NCrunch Tests window for you. But it appears that there is never any difference between the "Last Execution Time" and "Expected Execution Time" for any individual test. Also, I never see any execution times for fixtures on this window that seem to make any sense. I would expect them to be the sum of the individual tests, but they're always just values in the range of a few seconds, i.e. a value even lower than many of the individual test times.

It's on the NCrunch Processing Queue window that I get results that look to be more accurate, although, again, that window shows greatly differing results for Expected vs. Actual Processing Times.



The above is the result of running all of the tests in a particular project (I just clicked on the project name and Run selected test(s) in new process). This project contains 7 different fixtures, with a total of 973 tests, according to NCrunch. Three of those fixtures are as described in my previous post, inheriting from a single base class. Happily, most of those tests were well distributed, not just in running in different tasks, but also distributed across many nodes. However, one of those tasks contains 285 tests (what looks like an entire fixture). It was expected that this fixture would take 52 minutes to complete. It was accurately reported as taking about 30. The result of this was that most of the tests completed within probably 5-10 minutes. Then, a single task/thread spun and ran for an additional 20 minutes before I could receive any feedback on any of those 285 tests, while 40+ potential processes sat idle. I can confirm that all of these tests were passing previously, so there should probably not be much concern that the times were calculated to be artificially low (as a result of incorrectly fast-failing in a previous run).

Now, independent of whether it's correct that those 285 tests actually took 52 minutes to complete in a previous run, I would in no way expect that the engine might decide that we would want to run all of those tests sequentially, in a single task, especially if it thought it would take that long. Also, this is one of those 3 fixtures that inherits from the base test fixture. The other two fixtures had their 285+ tests distributed among different tasks and different nodes (although that's often not the case). So it doesn't seem like it could be the structure of those test classes that could be the cause of this, as they all share the same one. Also, these are pretty strictly "unit tests", in that database, file system, etc. are stubbed with in-memory implementations.

If we actually care about what the expected and actual processing times are: this is a single test run on a single project. Usually, these tests will be running with probably a few thousand other tests, including integration tests that will be hitting the file system, database, ElasticSearch index, and other slow infrastructure. So I could potentially see the system load alone being responsible for the tests taking much longer when all run together.

So...maybe the execution times are correct? I'm not sure it matters. From what I'm seeing in this test run, I respectfully submit that there may just be something in the batching logic that needs refining.

I do not use AtomicAttribute anywhere.
Remco
#13 Posted : Wednesday, June 29, 2022 12:34:46 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
Thanks for sharing these extra details.

I've taken another look at the way the batching logic works in the context of the use case you've described above. I think I understand what is happening here.

Basically, the batching is designed to watch for situations where a fixture may have a slower (more than 2 seconds) set up or tear down time. In such a situation, it will try to cram the tests together in the same batch to avoid needing to run the fixture setup/teardown multiple times over a cycle. This works well for 'typical' scenarios where you might have a slow setup fixture with a dozen odd tests, but it is rather disastrous when you have a 3 second fixture with 285 slow running tests. So basically the structure of your test suite has surfaced a serious efficiency problem in the batching.

The good news is that this shouldn't be too hard to fix (in NCrunch). We basically just need to adjust the threshold so that rather than using a fixed time (like 2 seconds), we instead consider the time taken by the fixture in proportion to its tests.

I'll let you know when the fix is available. I suspect the variability in your execution times may be due to CPU overload, in which case lowering the number of execution threads may reduce the occurrence of this problem until I can get you the fix.
Remco
#14 Posted : Wednesday, June 29, 2022 2:31:09 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
1 user thanked Remco for this useful post.
MarcChu on 7/1/2022(UTC)
MarcChu
#15 Posted : Friday, July 1, 2022 2:29:25 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/26/2014(UTC)
Posts: 22

Thanks: 3 times
Was thanked: 1 time(s) in 1 post(s)
Yes! That seems to have improved things greatly. It will take some time to determine whether this looks well solved, but I'm definitely optimistic, given the initial results.

Also, I initially thought there might still be a problem with re-running fast-failing tests all being directed to a single task, but now I don't seem to be seeing that behavior.
Remco
#16 Posted : Saturday, July 2, 2022 12:21:43 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 967 times
Was thanked: 1298 time(s) in 1203 post(s)
MarcChu;16183 wrote:
Yes! That seems to have improved things greatly. It will take some time to determine whether this looks well solved, but I'm definitely optimistic, given the initial results.


Excellent! Thanks for confirming this :)

MarcChu;16183 wrote:
Also, I initially thought there might still be a problem with re-running fast-failing tests all being directed to a single task, but now I don't seem to be seeing that behavior.


Keep an eye out. If you can find a way to reproduce other pipeline issues, I'll be happy to take a closer look.
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.130 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download