Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

TeamCity distributed NCrunch is very slow
robert-j-engdahl
#1 Posted : Monday, June 04, 2018 1:31:28 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 8
Location: Denmark

Thanks: 3 times
We went from PowerShell xunit console invocation on 8 cores @ 8 minutes to NCrunch grid with three nodes each with 18 assigned cores @ 7 minutes to 22 minutes, depending on the load.

Our log is full of entries like

[GridMessageSender-113] Resetting ping

but they don't need to have anything to do with the performance.

Any suggestions on what we should do?
Remco
#2 Posted : Monday, June 04, 2018 10:38:23 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,234

Thanks: 691 times
Was thanked: 843 time(s) in 803 post(s)
Hi, thanks for sharing this issue.

The problem with the resetting ping issue is caused by a diagnostic message that accidentally went out with the release build. Probably, this may have some impact on performance, though I haven't measured it in detail yet. I recommend turning your logging off to see if this makes any difference. This problem is a regression in v3.17.

Did you also try using this configuration with a pre-3.17 version of NCrunch? I'm not sure if you're describing a regression in existing behaviour or a performance problem with distributed processing in general.

An 8 minute build is fairly fast, so I'd be willing to bet that quite a large portion of this time is actually spent on things outside the test run, such as building the projects and synchronising with the grid nodes. Due to the limitations of the build system, the building of projects is duplicated over the nodes redundantly. The major advantage of NCrunch distributed processing is to provide higher throughput when executing tests. How many projects and tests do you have in the solution, and what is the normal execution time for each test? Are there any outlier tests with a long running time?
robert-j-engdahl
#3 Posted : Tuesday, June 05, 2018 5:49:37 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 8
Location: Denmark

Thanks: 3 times
I can't tell if this is a regression. We just tried out NCrunch with TeamCity to see if it would work. We are flying blind here.

I understand the diagnostic message will go away with the next upgrade. We look forward to it, and there might have been a minor release since we upgraded the server.

Just to clarify: The 8 minute build is *without* NCrunch and the unstable build time (between 7 and 22 minutes) is *with* NCrunch. To our surprise there was no improvement at all from distributing our CI (it improved our developer machine testings times tremendously). I understand the build cannot be speeded up from distributing, because there is a chance everyone needs to build everything anyway.

Looking further at the logs a surprising amount of time is spend doing so-called clean-up (like 3 minutes or so).

Our solution has 11000+ unit tests and most of them run in a mater of ms, but a few takes 20'ish seconds. We have about 256 projects in our solution, 72 of which are test assemblies.
Remco
#4 Posted : Tuesday, June 05, 2018 6:05:00 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,234

Thanks: 691 times
Was thanked: 843 time(s) in 803 post(s)
The benefits of distributed processing on end-to-end runs generally tend to favour solutions with a large number of medium to slow running tests. By this, I mean tests with a running time of 10 seconds up to about 10 minutes. When working with large test suites of very fast running tests, the overhead required to scale the execution of the tests can easily exceed the performance benefits of running them in parallel.

In your case, having 256 projects would probably give you a very good experience with NCrunch in the UI, because its build optimisations would allow it to very quickly progress to running tests after changes are made to the code. However, this would work against you when using the console tool with distributed processing, because every node in the grid would need to build each of those 256 projects before it could run all of the tests. The first time each project is built on any machine, NCrunch also needs to create a workspace for it, which can be quite I/O intensive when large files are involved.

When you mean work is being spent on 'clean-up', can you share any more details on this? Is this something that NCrunch is doing?
robert-j-engdahl
#5 Posted : Tuesday, June 05, 2018 6:17:01 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 8
Location: Denmark

Thanks: 3 times
I remembered wrongly. The clean-up notifications are like

[Core-20] Cleaning up workspace: C:\Users\administrator.ACCU\AppData\Local\NCrunch\3748\25

and they amount to 20'ish seconds. We only see them when ncrunch is configured to also build and test on the TeamCity server. I expect the nodes are doing something similar, but it doesn't make it into our logs.

If you say distributed NCrunch is good at longer running tests, perhaps we can just increase the batch size? We didn't find an option to do that, but that would probably work great.
Remco
#6 Posted : Tuesday, June 05, 2018 9:12:00 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,234

Thanks: 691 times
Was thanked: 843 time(s) in 803 post(s)
NCrunch doesn't have an option to directly control the batching of tests, outside of AtomicAttribute.

Though in all honesty, given the structure of your solution I don't think you're going to make any big wins here. A build that runs in 8 minutes on a solution of that size does not give many options as far as optimisation is concerned. However, it's worth considering that the distributed processing will make your build system much more scalable. As the solution continues to grow and new tests are added, the extra capacity of your grid will likely perform wonders in preventing the build running time from increasing much beyond its current point.
robert-j-engdahl
#7 Posted : Tuesday, June 05, 2018 9:23:46 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 8
Location: Denmark

Thanks: 3 times
It appears the AtomicAttribute can be applied to assemblies as well. So it would be relatively easy for us to test if that improves speed (there are still 72 of them, so there is something to be parallelized). I'll keep you posted on how that goes.

If that works wonders, I would suggest a default-distribution-batch-size option for anatomic tests.
Remco
#8 Posted : Tuesday, June 05, 2018 11:16:09 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,234

Thanks: 691 times
Was thanked: 843 time(s) in 803 post(s)
robert-j-engdahl;12281 wrote:
It appears the AtomicAttribute can be applied to assemblies as well. So it would be relatively easy for us to test if that improves speed (there are still 72 of them, so there is something to be parallelized). I'll keep you posted on how that goes.


It can, but unfortunately it likely won't have the effect you're after. When applied at assembly level, the engine behaves as though its been applied to every fixture within the assembly. So every fixture becomes atomic, meaning that you can still have multiple batches for the same assembly (for example, maybe 1 for each fixture).
robert-j-engdahl
#9 Posted : Wednesday, June 06, 2018 9:30:15 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 8
Location: Denmark

Thanks: 3 times
Ok,

That is consistent with the results I can measure. Then I'd like to make a feature-request for a default-batch-size option, because I fear there is a huge overhead in the batch communication.

I still think the distributed NCrunch builds are insanely unstable in terms of build-times, compared to invoking xunit directly. Here is are the build times from our different branches (they are not many tests appart), and the distribution of build-times is just very wide. We must be doing something wrong.



Before we tried out NCrunch (a week ago), these would be 7 minutes for a lone build on an idle build-node and up to 12 minutes if other build nodes were bussy (Each build node is a virtual machine on the same server).
Remco
#10 Posted : Wednesday, June 06, 2018 10:51:41 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,234

Thanks: 691 times
Was thanked: 843 time(s) in 803 post(s)
The 'batching' of tests is very dynamic, as NCrunch will try to put faster running tests into bigger batches. You'll find that if you have 1000 tests with a running time of <1ms, they'll usually all land in the same task. Slow running tests will generally end up in small batches, and very often can actually be in their own task. This usually relieves the need for micro-managing this sort of thing. It does also mean that there is no set 'batch size' inside the engine itself.

If the nodes are being shared with other clients or build agents, this could cause huge variations in the availability of their resources during builds. I'd suggest taking a look at the logs to see if you can determine where the time is being spent.

I also recommend looking into the behaviour of the nodes themselves. If the nodes are doing other work or if there is other software running on them that might consume resources, they may give inconsistent execution times. Watch out for Windows Defender especially - we've seen this eat up our own nodes seemingly at random.

It occurs to me that we really lack an effective way to troubleshoot performance like this, because all we have are the summary execution times and a giant log file full of noise. A more visual display depicting where the time is spent would be much more useful. I'd like to do some work in this area and get back to you when I have a better way to help you drill into this data.
robert-j-engdahl
#11 Posted : Wednesday, June 06, 2018 11:06:30 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 1/26/2017(UTC)
Posts: 8
Location: Denmark

Thanks: 3 times
That sounds awesome. Flying blind is not fun. I'd like to see some measurement of the communication overhead and how much time is spend waiting for NCrunch nodes to become ready.

A lot of different software is involved in this setup, so a trouble shooting guide would also be quite nice (but then again, we have the forums).

It is great that you have auto-batch size already. I agree setting a default would be redundant. How will tests with unknown execution times be handled in that setup, and could [AutoData] cause known tests to seem unkown?
Remco
#12 Posted : Wednesday, June 06, 2018 11:52:48 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,234

Thanks: 691 times
Was thanked: 843 time(s) in 803 post(s)
robert-j-engdahl;12295 wrote:

It is great that you have auto-batch size already. I agree setting a default would be redundant. How will tests with unknown execution times be handled in that setup, and could [AutoData] cause known tests to seem unkown?


By default, NCrunch considers a test to have an execution time of 1 second. So this would greatly reduce the efficiency of the test run if NCrunch doesn't have any execution history for your tests at the start of the run. You've probably already done this, but do check to make sure that you have the NCrunch cache file storage path set to an absolute path on your build server, so the test execution times can be shared across runs.

AutoData is a risk factor if it's creating tests with unstable parameter data. I think we did eventually find a fix for this with the developers of this library, but I recommend checking in your own NCrunch UI to make sure that you're not seeing new tests appearing all the time while you work.

I also agree that the documentation and general UX in this area of NCrunch is not very good. The console tool has mostly been considered a side-feature. It's only really been this year that we've had a renewed focus on improving it and bringing it up to a higher standard. I think that having a way to graphically represent the behaviour of the engine during the test run would be a huge help in understanding where the problems are.
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.081 seconds.