Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

TeamCity distributed NCrunch is very slow
robert-j-engdahl
#1 Posted : Monday, June 4, 2018 1:31:28 PM(UTC)
Rank: Member

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 13
Location: Denmark

Thanks: 4 times
We went from PowerShell xunit console invocation on 8 cores @ 8 minutes to NCrunch grid with three nodes each with 18 assigned cores @ 7 minutes to 22 minutes, depending on the load.

Our log is full of entries like

[GridMessageSender-113] Resetting ping

but they don't need to have anything to do with the performance.

Any suggestions on what we should do?
Remco
#2 Posted : Monday, June 4, 2018 10:38:23 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi, thanks for sharing this issue.

The problem with the resetting ping issue is caused by a diagnostic message that accidentally went out with the release build. Probably, this may have some impact on performance, though I haven't measured it in detail yet. I recommend turning your logging off to see if this makes any difference. This problem is a regression in v3.17.

Did you also try using this configuration with a pre-3.17 version of NCrunch? I'm not sure if you're describing a regression in existing behaviour or a performance problem with distributed processing in general.

An 8 minute build is fairly fast, so I'd be willing to bet that quite a large portion of this time is actually spent on things outside the test run, such as building the projects and synchronising with the grid nodes. Due to the limitations of the build system, the building of projects is duplicated over the nodes redundantly. The major advantage of NCrunch distributed processing is to provide higher throughput when executing tests. How many projects and tests do you have in the solution, and what is the normal execution time for each test? Are there any outlier tests with a long running time?
robert-j-engdahl
#3 Posted : Tuesday, June 5, 2018 5:49:37 AM(UTC)
Rank: Member

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 13
Location: Denmark

Thanks: 4 times
I can't tell if this is a regression. We just tried out NCrunch with TeamCity to see if it would work. We are flying blind here.

I understand the diagnostic message will go away with the next upgrade. We look forward to it, and there might have been a minor release since we upgraded the server.

Just to clarify: The 8 minute build is *without* NCrunch and the unstable build time (between 7 and 22 minutes) is *with* NCrunch. To our surprise there was no improvement at all from distributing our CI (it improved our developer machine testings times tremendously). I understand the build cannot be speeded up from distributing, because there is a chance everyone needs to build everything anyway.

Looking further at the logs a surprising amount of time is spend doing so-called clean-up (like 3 minutes or so).

Our solution has 11000+ unit tests and most of them run in a mater of ms, but a few takes 20'ish seconds. We have about 256 projects in our solution, 72 of which are test assemblies.
Remco
#4 Posted : Tuesday, June 5, 2018 6:05:00 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
The benefits of distributed processing on end-to-end runs generally tend to favour solutions with a large number of medium to slow running tests. By this, I mean tests with a running time of 10 seconds up to about 10 minutes. When working with large test suites of very fast running tests, the overhead required to scale the execution of the tests can easily exceed the performance benefits of running them in parallel.

In your case, having 256 projects would probably give you a very good experience with NCrunch in the UI, because its build optimisations would allow it to very quickly progress to running tests after changes are made to the code. However, this would work against you when using the console tool with distributed processing, because every node in the grid would need to build each of those 256 projects before it could run all of the tests. The first time each project is built on any machine, NCrunch also needs to create a workspace for it, which can be quite I/O intensive when large files are involved.

When you mean work is being spent on 'clean-up', can you share any more details on this? Is this something that NCrunch is doing?
robert-j-engdahl
#5 Posted : Tuesday, June 5, 2018 6:17:01 AM(UTC)
Rank: Member

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 13
Location: Denmark

Thanks: 4 times
I remembered wrongly. The clean-up notifications are like

[Core-20] Cleaning up workspace: C:\Users\administrator.ACCU\AppData\Local\NCrunch\3748\25

and they amount to 20'ish seconds. We only see them when ncrunch is configured to also build and test on the TeamCity server. I expect the nodes are doing something similar, but it doesn't make it into our logs.

If you say distributed NCrunch is good at longer running tests, perhaps we can just increase the batch size? We didn't find an option to do that, but that would probably work great.
Remco
#6 Posted : Tuesday, June 5, 2018 9:12:00 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
NCrunch doesn't have an option to directly control the batching of tests, outside of AtomicAttribute.

Though in all honesty, given the structure of your solution I don't think you're going to make any big wins here. A build that runs in 8 minutes on a solution of that size does not give many options as far as optimisation is concerned. However, it's worth considering that the distributed processing will make your build system much more scalable. As the solution continues to grow and new tests are added, the extra capacity of your grid will likely perform wonders in preventing the build running time from increasing much beyond its current point.
robert-j-engdahl
#7 Posted : Tuesday, June 5, 2018 9:23:46 AM(UTC)
Rank: Member

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 13
Location: Denmark

Thanks: 4 times
It appears the AtomicAttribute can be applied to assemblies as well. So it would be relatively easy for us to test if that improves speed (there are still 72 of them, so there is something to be parallelized). I'll keep you posted on how that goes.

If that works wonders, I would suggest a default-distribution-batch-size option for anatomic tests.
Remco
#8 Posted : Tuesday, June 5, 2018 11:16:09 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
robert-j-engdahl;12281 wrote:
It appears the AtomicAttribute can be applied to assemblies as well. So it would be relatively easy for us to test if that improves speed (there are still 72 of them, so there is something to be parallelized). I'll keep you posted on how that goes.


It can, but unfortunately it likely won't have the effect you're after. When applied at assembly level, the engine behaves as though its been applied to every fixture within the assembly. So every fixture becomes atomic, meaning that you can still have multiple batches for the same assembly (for example, maybe 1 for each fixture).
robert-j-engdahl
#9 Posted : Wednesday, June 6, 2018 9:30:15 AM(UTC)
Rank: Member

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 13
Location: Denmark

Thanks: 4 times
Ok,

That is consistent with the results I can measure. Then I'd like to make a feature-request for a default-batch-size option, because I fear there is a huge overhead in the batch communication.

I still think the distributed NCrunch builds are insanely unstable in terms of build-times, compared to invoking xunit directly. Here is are the build times from our different branches (they are not many tests appart), and the distribution of build-times is just very wide. We must be doing something wrong.



Before we tried out NCrunch (a week ago), these would be 7 minutes for a lone build on an idle build-node and up to 12 minutes if other build nodes were bussy (Each build node is a virtual machine on the same server).
Remco
#10 Posted : Wednesday, June 6, 2018 10:51:41 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
The 'batching' of tests is very dynamic, as NCrunch will try to put faster running tests into bigger batches. You'll find that if you have 1000 tests with a running time of <1ms, they'll usually all land in the same task. Slow running tests will generally end up in small batches, and very often can actually be in their own task. This usually relieves the need for micro-managing this sort of thing. It does also mean that there is no set 'batch size' inside the engine itself.

If the nodes are being shared with other clients or build agents, this could cause huge variations in the availability of their resources during builds. I'd suggest taking a look at the logs to see if you can determine where the time is being spent.

I also recommend looking into the behaviour of the nodes themselves. If the nodes are doing other work or if there is other software running on them that might consume resources, they may give inconsistent execution times. Watch out for Windows Defender especially - we've seen this eat up our own nodes seemingly at random.

It occurs to me that we really lack an effective way to troubleshoot performance like this, because all we have are the summary execution times and a giant log file full of noise. A more visual display depicting where the time is spent would be much more useful. I'd like to do some work in this area and get back to you when I have a better way to help you drill into this data.
robert-j-engdahl
#11 Posted : Wednesday, June 6, 2018 11:06:30 AM(UTC)
Rank: Member

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 13
Location: Denmark

Thanks: 4 times
That sounds awesome. Flying blind is not fun. I'd like to see some measurement of the communication overhead and how much time is spend waiting for NCrunch nodes to become ready.

A lot of different software is involved in this setup, so a trouble shooting guide would also be quite nice (but then again, we have the forums).

It is great that you have auto-batch size already. I agree setting a default would be redundant. How will tests with unknown execution times be handled in that setup, and could [AutoData] cause known tests to seem unkown?
Remco
#12 Posted : Wednesday, June 6, 2018 11:52:48 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
robert-j-engdahl;12295 wrote:

It is great that you have auto-batch size already. I agree setting a default would be redundant. How will tests with unknown execution times be handled in that setup, and could [AutoData] cause known tests to seem unkown?


By default, NCrunch considers a test to have an execution time of 1 second. So this would greatly reduce the efficiency of the test run if NCrunch doesn't have any execution history for your tests at the start of the run. You've probably already done this, but do check to make sure that you have the NCrunch cache file storage path set to an absolute path on your build server, so the test execution times can be shared across runs.

AutoData is a risk factor if it's creating tests with unstable parameter data. I think we did eventually find a fix for this with the developers of this library, but I recommend checking in your own NCrunch UI to make sure that you're not seeing new tests appearing all the time while you work.

I also agree that the documentation and general UX in this area of NCrunch is not very good. The console tool has mostly been considered a side-feature. It's only really been this year that we've had a renewed focus on improving it and bringing it up to a higher standard. I think that having a way to graphically represent the behaviour of the engine during the test run would be a huge help in understanding where the problems are.
Remco
#13 Posted : Wednesday, August 1, 2018 8:23:54 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
To follow up with this issue, we've introduced a major new feature in the just released v3.18 that creates a detailed timeline report of tasks executed by the console tool. This should be an effective tool for troubleshooting CI performance issues.
Kirlu
#14 Posted : Thursday, August 2, 2018 11:04:02 AM(UTC)
Rank: Member

Groups: Registered
Joined: 6/8/2017(UTC)
Posts: 10
Location: Democratic Republic of the Congo

Was thanked: 2 time(s) in 2 post(s)
We have updated to latest Ncrunch console and Grid Server and we see an issue.
It seems that it will only use one Ncrunch Test Host on each machine while running on TeamCity using Console.

What configuration causes this behavior:
Quote:
<GlobalConfiguration>
<Settings>
<CPUCoresAssignedToNCrunch>1,2,3,4,5,6,7,8</CPUCoresAssignedToNCrunch>
<CPUCoresAssignedToVisualStudio>1,2,3,4,5,6,7,8</CPUCoresAssignedToVisualStudio>
<FastLaneThreads>1</FastLaneThreads>
<GridReconnectionDelayInSeconds>0</GridReconnectionDelayInSeconds>
<GridServerReferencesForComputer>
<Value>askvision:80&gt;</Value>
<Value>jnyvision:80&gt;</Value>
</GridServerReferencesForComputer>
<MaxNumberOfProcessingThreads>0</MaxNumberOfProcessingThreads>
<MaxTestRunnerProcessesToPool>3</MaxTestRunnerProcessesToPool>
<SystemConfigured>True</SystemConfigured>
<TerminateTestRunnerTasksOnExecutionComplete>False</TerminateTestRunnerTasksOnExecutionComplete>
<UseSimplifiedSettingsWhereAvailable>True</UseSimplifiedSettingsWhereAvailable>
<NCrunchCacheStoragePath>c:\ncrunch</NCrunchCacheStoragePath>
</Settings>
<EngineModes>
<EngineMode>
<Name>TeamCity Continues Integration</Name>
<Settings>
<TestsToExecuteAutomatically>(DoesNotHaveCategory 'TeamCityIgnore')</TestsToExecuteAutomatically>
</Settings>
</EngineMode>
</EngineModes>
</GlobalConfiguration>



Kirlu
#15 Posted : Thursday, August 2, 2018 11:34:27 AM(UTC)
Rank: Member

Groups: Registered
Joined: 6/8/2017(UTC)
Posts: 10
Location: Democratic Republic of the Congo

Was thanked: 2 time(s) in 2 post(s)
We added the following parameter to our configuration fixing the error:
Quote:
<AllowParallelTestExecution>True</AllowParallelTestExecution>
Remco
#16 Posted : Thursday, August 2, 2018 11:37:49 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Kirlu;12486 wrote:
We added the following parameter to our configuration fixing the error:
Quote:
<AllowParallelTestExecution>True</AllowParallelTestExecution>


That would be the one! I'm glad you managed to find this!
Jogge
#17 Posted : Monday, August 6, 2018 6:09:03 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 8/6/2018(UTC)
Posts: 2
Location: Denmark

We have this configuration for our TeamCity Build Agent:

Code:
<GlobalConfiguration>
  <Settings>
    <CPUCoresAssignedToNCrunch>0,1,2,3,4,5,6,7</CPUCoresAssignedToNCrunch>
    <CPUCoresAssignedToVisualStudio></CPUCoresAssignedToVisualStudio>
    <FastLaneThreads>1</FastLaneThreads>
    <GridReconnectionDelayInSeconds>10</GridReconnectionDelayInSeconds>
    <GridServerReferencesForComputer>
      <Value>askvision:80&gt;</Value>
      <Value>pgpvision:80&gt;</Value>
      <Value>jnyvision:80&gt;</Value>
      <Value>jlsvision:80&gt;</Value>
    </GridServerReferencesForComputer>
    <DisabledNodes>
      <Value>(local):0</Value>
    </DisabledNodes>
    <MaxNumberOfProcessingThreads>8</MaxNumberOfProcessingThreads>
    <MaxTestRunnerProcessesToPool>1</MaxTestRunnerProcessesToPool>
    <AllowParallelTestExecution>True</AllowParallelTestExecution>
    <SystemConfigured>True</SystemConfigured>
    <TerminateTestRunnerTasksOnExecutionComplete>False</TerminateTestRunnerTasksOnExecutionComplete>
    <UseSimplifiedSettingsWhereAvailable>True</UseSimplifiedSettingsWhereAvailable>
    <NCrunchCacheStoragePath>c:\ncrunch</NCrunchCacheStoragePath>
  </Settings>
  <EngineModes>
    <EngineMode>
      <Name>TeamCity Continues Integration</Name>
      <Settings>
        <TestsToExecuteAutomatically>(DoesNotHaveCategory 'TeamCityIgnore')</TestsToExecuteAutomatically>
      </Settings>
    </EngineMode>
  </EngineModes>
</GlobalConfiguration>


This following screenshot is from the new awesome NCrunch Timeline Report:

NCrunch timeline report

As you can see ASKVISION has just started building when all the other Grid Nodes already has completed all tasks. Why is ASKVISION not stopped when everything else is completed? And why does each Grid Node build everything and not only the parts dependend from the test projects the Grid Node is assigned to test?
Remco
#18 Posted : Monday, August 6, 2018 7:33:09 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,177

Thanks: 968 times
Was thanked: 1298 time(s) in 1203 post(s)
Hi, thanks for posting!

Quote:

As you can see ASKVISION has just started building when all the other Grid Nodes already has completed all tasks. Why is ASKVISION not stopped when everything else is completed? And why does each Grid Node build everything and not only the parts dependend from the test projects the Grid Node is assigned to test?


It looks to me like ASKVISION is performing an extensive synchronisation with the grid client at the start of the run. Each grid node has on it a number of stored snapshots representing solutions that have been previously transferred from grid clients over to the server for processing. These snapshots are retained between sessions, so that once the files have been copied over, it's less likely that they'll need to be copied again.

The management of snapshots on a grid node is a complex thing. This is because solutions can often undergo large changes (i.e. branch switches), and grid nodes are often also shared by multiple grid clients. Sometimes it becomes necessary to create a new snapshot on a grid node when a client connects, because no suitable existing snapshot is available to service the client. This can result in a lengthy synchronisation involving many files being copied or transferred between snapshots on the grid node.

So it's quite possible that you won't see this again. If you do, it may be that your solution storage data limit is set too low on the grid node.

Regarding the building of projects on each node, this is due to the design of the distributed processing. NCrunch cannot run tests without having first built all projects required in the runtime environment. This means that build tasks are duplicated over the grid, with each node responsible for building its own artifacts.

It is technically possible for such artifacts to be built on only one machine and then be transferred to others (thereby reducing resource requirements), but such a system comes with its own complexities and limitations. We have not yet implemented anything like this in NCrunch's distributed processing. So you'll generally find you get better results when using bigger chunkier grid nodes as opposed to a larger number of small nodes.

When NCrunch chooses which projects to build, it will automatically build all projects in the solution regardless of which ones contain tests. This is for two reasons:
1. It is difficult to know which projects actually contain tests without having a compiled assembly to interrogate. The discovery of tests is dependent on a project first being built.
2. Although it is used primarily as a test runner, NCrunch is also a build tool. If you have a project in your solution that fails to build, NCrunch will report the failure.

It's possible to exclude certain projects from being built on grid nodes by using the capabilities system. If you have projects in your solution that are not required for execution of your tests and you aren't interested in seeing the results of their build in the NCrunch output, excluding them using capabilities will improve the performance of your build. Another option would be creating a different solution file that contains only your test projects and using this for the target of the NCrunch console tool run.


Edit: I've also just noticed that the build tasks on ASKVISION took much longer to run than on the other nodes. How does the hardware of this machine compare to the other machines in your grid? Is it possible the machine was tied up with some kind of system related task? We've personally experienced some issues like this with Windows Defender.

Another Edit: You have a very good an valid question of why the run wasn't terminated when all the tests had finished executing. The honest answer is that when using NCrunch inside VS, you wouldn't want this to happen because you'd likely still be making changes and new tests would need to be executed. Under a CI scenario, you would want the run to terminate as early as possible. We haven't implemented such a consideration into the console tool yet (which is basically just a headless client). Probably it's something we should look at :)
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.139 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download