Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

Tests on grid nodes should have higher default priority
Ralf Koban
#1 Posted : Saturday, September 13, 2014 11:00:15 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hello Remco,

When I let the tests (about 7000) run in NCrunch (via "Run all tests" on test window), NCrunch tries to run all the tests locally before it actually starts to build them in the grid.

This is at least what I think when I take a look into the processing queue where the remote builds have a priority of 1000 whereas the local tests have a priority of around less than 0.

It would be really good if the tests on the grid nodes would have a similar priority as the local ones as so the grid can already start to run the tests while the local tests are running as well. That would dramatically reduce the overall feedback time (which is very high, dependent on what I do about 20 minutes or more in thats specific situation).


Some background info:
On my machine I've set up 2 local fastlane test runner threads (theshold is 250 ms) and 4 test runner threads whereas I've set up 3 additional grid nodes with 8-12 test runners each. So, running all tests in the grid (and locally) would (normally) take about 1 minute to complete. But in that specific situation, only my 2 local non-fastlane test runner threads are executed most of the time, thus slowing down the feedback significantly.

BTW: With the solution I'm using, my VS consumes about 2 GB of RAM in such situation. I'm not sure whether the garbage collection comes here into play but I assume so as in the SysInternals ProcessExplorer I see the time spent in GC is about 20 %.



Best regards,
Ralf Koban
Remco
#2 Posted : Saturday, September 13, 2014 10:54:38 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Hi Ralf -

Something isn't behaving correctly here. The priority of tasks in the queue should only make a difference when there are no concurrency/resource constraints involved and the engine is out of capacity. If there are still execution threads available on the grid and there is work available, then these threads should be working.

What happens if you increase your fast lane thread threshold? 250ms seems unusually low and in some cases may stop NCrunch from executing tests that could run fairly quickly anyway (usually around 1000ms-10000ms is more normal for this setting).

Also, does disabling the local processing in your distributed processing window make any difference?

It seems to me like your engine has an overall capacity of around 30 threads. Do you ever see the engine hit this full capacity if you hover your mouse over the corner spinner?

Also, what is the normal end-to-end run time for all of your 7000 tests? Do you have some slower ones in there?
Ralf Koban
#3 Posted : Monday, September 15, 2014 6:28:59 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

The end-to-end run time for all 7000 tests is about 4 minutes (done in ReSharper's unit tester with tests running on a single thread).

Yes, when running properly, my engine has an overall capacity of 32-36 threads. I've also seen often that the engine hits this full capacity.
The threshold has been set to 250ms to speed up overall performance. I've experienced with haves of 1000ms, 500ms, 300ms and 100ms and it seems that with 250ms I get the fastest feedback. I have plenty of tests that take less than 100ms and some that take up to 8-10s.

For the moment I believe that this has something to to with the memory consumption. It seems that my VS is taking around 2 GB of RAM. When my VS takes more RAM (2.3 GB+) or much less RAM (1.5GB) the problem doesn't seem to occur. When the problem occurs it takes a very long time to get the windows such as the Processing Queue updated. But I see the corner spinner stating that it e.g. executes 29 tests tasks. Strange thing is that when I take a look onto the Distributed Processing page then only 1 node with about 8 test tasks seems to process the tests, the others are just "online". But maybe this is the same update issue as with the Processing Queue.
And when I for example get the tooltip for the Risk/Progress bar, then it states that there are about 58 seconds remaining but it takes longer.

When I now tried to test to run the tests with the local grid disabled but nothing happens. VS is definitely doing something (CPU usage is around 21-26%) but no test is triggered (test window says that 7376 tests are queued for execution but corner spinner tooltip says that NCrunch engine is idle). For that I sent you a bug report file via the Contact form.

After restarting VS, I can now run all the tests in a grid only mode (which takes about 1.25m according to the tooltip in the risk/progress bar). The RAM consumption is now only about 1.5 GB.

BR,
Ralf
Remco
#4 Posted : Monday, September 15, 2014 6:36:25 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Hi Ralf,

Would you be able to try working with the Risk/Progress bar closed for a while? This UI element has some known performance issues that can block up the engine. Based on what you've described, I think the core engine is being overloaded by this element and it isn't able to keep up with the demand.

Something that can also help to narrow this down is to try playing with the filters on the Tests Window when you see the engine performing badly. Do you notice the Tests Window updates itself quickly when you do this? Or is there a long delay after each filter change?
Ralf Koban
#5 Posted : Monday, September 15, 2014 7:23:13 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

I will try to work with the Progress bar closed for a while and see whether it makes some difference. I will let you know the results.

Best regards,
Ralf
Ralf Koban
#6 Posted : Tuesday, September 16, 2014 6:33:40 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

It seems that the VS performance is better with having that Progress bar closed.

However, my initial problem that NCrunch doesn't contact the grid nodes to run the tests is still there. Today, when I opened the solution and let NCrunch re-run all the tests (via "Run all tests" on test window) after NCrunch finished the initial test run, all tests where executed only locally but not in the grid. When I took a look into the processing queue window I saw that all tests and build assembly entries for the grid nodes had a priority of 1000 whereas the ones on my local machine were below zero.

When I reset the engine (on the test window), all tests were able to execute also in the grid. But when I let all the tests run again (via "Run all tests"), my VS crashed. At the time it was crashing I was trying to get a tooltip on the spinning corner. First it was stating 7xxx tests for around 40 test threads (I increased the local ones also to 8), then it was stating 6xxx tests and I wanted to know how much tests got executed in parallel. So I hovered the spinning corner but nothing happened. So I hovered it again and then again for a third time but then my VS crashed.


Best regards,
Ralf
Remco
#7 Posted : Tuesday, September 16, 2014 8:57:03 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Hi Ralf,

Let's separate this into two issues we need to deal with :)

First of all, the crash issue. Broadly I can see two possible reasons that could make NCrunch crash in this situation. The first is poor handling of an exception thrown on the UI thread. This is pretty rare these days and continues to get rarer as the product matures. The second reason is a memory related issue (i.e. OutOfMemoryException). Is there any chance you can get a debugger on the VS process when it crashes? Knowing the exception being thrown here would be really helpful.

The other issues is around figuring out why the nodes aren't picking up work when they should be. It may take a little Q&A to figure out what is going wrong here.. but the behaviour you've described doesn't seem normal to me and is probably the result of something going wrong. Unfortunately because of the amount of activity caused by 40 execution threads, the logs aren't very helpful here. Do you mind if I bombard you with some questions?

When you have a test run that doesn't touch the nodes ...
- Can you confirm that the distributed processing window is showing the nodes as connected with a status of 'Online'?
- Do you see any build tasks completed on the nodes when you look in the processing queue?
- How are the tests distributed within the processing queue? I.e. is it 7000 tests split over 2-3 tasks, or 7000 tests split over several hundred tasks?
- How does the priority compare between the build tasks that are in the queue? Normally the build tasks for remote nodes should be the same as the local build tasks
- Have you managed to get a glimpse of what the nodes are doing when the local machine is running all the tests? Does the distributed processing window show them processing any tasks at all?
- What is your normal end-to-end test time for a full cycle?
Ralf Koban
#8 Posted : Tuesday, September 16, 2014 2:48:01 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

ok. :)

About the crash:
I will try to attach a debugger in case it happens again.

About the other issue (the strange thing is that most times it is working as expected):
- The normal end-to-end time for a full cycle (in case there is no build needed) depends a little bit on the test execution order on the grid. In best times it is about 1:05 minutes, currently it is about 1:36 minutes.
- The tests seem to get distributed over several dozens of tasks.
- If I remember correctly, then the nodes are reported with 'Online' in that situation. As soon as I uncheck and recheck them, the corresponding grid node seems to start processing.
- The build tasks are set to 1000 in that situation and are marked as pending.
- Actually, when the nodes are not running, it seems that the distributed processing windows but also other windows (test window, processing queue) are somewhat slow on refresh (maybe this was due to the Risk bar refresh issue).
But as long as I can see they are not performing anything, just waiting.

Currently, the NCrunch Processing queue states something "Reading cached data" while the tests are run only locally. The nodes are "Online" but no task is shown for them in the Distributed Processing window. What I did to get this happen was to close and re-open the solution. Even when I now click "Run all tests" they run only on the local machine but not in the grid. As mentioned, when I unchecked and checked the grid nodes comboboxes again the grid nodes start to process the tests.

Could it be that something corrupts the cache somehow when I close the solution?
Remco
#9 Posted : Tuesday, September 16, 2014 9:55:48 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Is there a consistent set of steps you can perform to make the tests run only on the local machine while you're connected to the grid nodes? I'm hoping there may be some way for me to reproduce this problem .... Also, do you have any other solutions on your machine that you can test this with? It would be great if we can determine if there is something in the solution itself that is triggering the problem.

Cache file corruption problems seem to be widespread at the moment and I'm expecting to release v2.10 soon to address this. I've prepared an early build if you'd like to give it a try. The build is backwards compatible with your grid nodes, so you'll only need to update the client. I'm afraid that this probably won't address the main problem you're experiencing at the moment, but at least it will get the caching issues out of the way:

http://downloads.ncrunch.net/NCrunch_Console_2.10.0.4.msi
http://downloads.ncrunch.net/NCrunch_Console_2.10.0.4.zip
http://downloads.ncrunch.net/NCrunch_GridNodeServer_2.10.0.4.msi
http://downloads.ncrunch.net/NCrunch_GridNodeServer_2.10.0.4.zip
http://downloads.ncrunch.net/NCrunch_VS2008_2.10.0.4.msi
http://downloads.ncrunch.net/NCrunch_VS2010_2.10.0.4.msi
http://downloads.ncrunch.net/NCrunch_VS2010_2.10.0.4.zip
http://downloads.ncrunch.net/NCrunch_VS2012_2.10.0.4.msi
http://downloads.ncrunch.net/NCrunch_VS2012_2.10.0.4.zip
http://downloads.ncrunch.net/NCrunch_VS2013_2.10.0.4.msi
http://downloads.ncrunch.net/NCrunch_VS2013_2.10.0.4.zip
Ralf Koban
#10 Posted : Wednesday, September 17, 2014 7:36:29 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

unfortunately there doesn't seem to be a consistent set of steps so far. I installed the new v2.10 early build and it seemed to go well. I could trigger the tests and they were run concurrently each time. I also closed and re-opened the solution and the tests were run concurrently.
But then I got the latest version from TFS. At that time the tests were run concurrently. One test failed and I let it re-run while the others were still running. After all have succeeded I triggered another run via the Tests window. Now the tests are run only locally. And if I now trigger another test run, the tests seem to run only locally as well.

When I take a look into the processing queue then I see that the grid nodes have a pending task to build the application's executable but I'm not sure as I don't see any information on the Distributed Processing page (there they are shown as online).
Remco
#11 Posted : Wednesday, September 17, 2014 7:40:16 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Ralf Koban;6408 wrote:
Hi Remco,
When I take a look into the processing queue then I see that the grid nodes have a pending task to build the application's executable but I'm not sure as I don't see any information on the Distributed Processing page (there they are shown as online).


I think this is the key piece of information on this issue. If a grid node refuses to build a required project, it won't be able to run any tests that depend on this project.

Do you notice any pattern around which project(s) are stalling the nodes? Is it always the exe? Are you making use of any capabilities?
Ralf Koban
#12 Posted : Wednesday, September 17, 2014 7:58:50 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

the specific project had an issue with a license check some time ago so I had set up some capabilitites to not run the project in the grid. Later on I found a workaround posted here in the forum which stated I could exclude the .licx file by adding a condition to the .csproj file.
After I did this I was able to build the project also in the grid without any issue.

The strange thing is that some minutes ago when I discovered that the projects get build only locally, I saw that VS2013 was consuming 2.3-2.5 GB RAM but was still running at 20% or so (so 1 core completely involved into doing something whereas I did nothingg - just sitting and waiting). So I started up another VS and attached it to the VS process. It took a long time to attach (had to load a lot of symbols from the symbol server) but when it was finally attached I paused it so get an idea what it could do. I did not get any, so I pressed Continue. When I then switched back to the original VS, I saw that now it was executing all the pending tests etc. on the grid.

For 1 grid node I also got an error stating that the disk is full (I use a RAMDisk with 2 GB RAM there), so I restarted the node which triggers the cleanup on the RAMDisk.

So I'm not sure whether this is an issue with the executable itself, also because I don't have tests that rely on it.


======
Update:

I just changed some code in the executable project. NCrunch run the tests only on the local nodes, but not on the grid. After NCrunch was finished (according to the UI), I paused VS again.
Then I inspected the threads and I saw 1 NCrunch related thread with following stacktrace:

Not Flagged 19388 104 Worker Thread Worker Thread nCrunch.Core.dll! . Lowest
nCrunch.Core.dll! .(int )
nCrunch.Core.dll!nCrunch.Core.Processing.ResourceUsageStamp.ToString()
mscorlib.dll!string.Concat(object[] args)
nCrunch.Core.dll!nCrunch.Core.Grid.Messages.NodeWorkRequestMessage.ToString()
mscorlib.dll!string.Concat(object arg0, object arg1, object arg2)
nCrunch.Core.dll!nCrunch.Core.Grid.Connectivity.NetworkClientMessageReceivedEvent.GetEventSummary()
nCrunch.Common.dll!nCrunch.Common.RoutedEvent.ToString()
mscorlib.dll!string.Concat(object[] args)
nCrunch.Core.dll!nCrunch.Core.Threading.CoreMessageDispatcher.()
nCrunch.Core.dll!nCrunch.Core.Threading.PooledWorkItem.Start()
nCrunch.Core.dll!nCrunch.Core.Threading.ThreadFactory.(object )
mscorlib.dll!System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(object state)
mscorlib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx)
mscorlib.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state, bool preserveSyncCtx)
mscorlib.dll!System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
mscorlib.dll!System.Threading.ThreadPoolWorkQueue.Dispatch()
mscorlib.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()



After I copied the stacktrace and resumed VS, I switched back to the first VS. At that time I saw that NCrunch was again starting to build and execute the tests on the grid nodes. If I now run the tests via the Tests window, they are executed in the grid (40 nodes in parallel).


The strange thing now is that now when I change the same code (eg. undo and redo), all tests get executed concurrently.
But sometimes NCrunch doesn't execute the tests at all (engine mode is "Run all tests automatically"), it solely builds the executable project on the grid nodes.
Remco
#13 Posted : Wednesday, September 17, 2014 11:37:45 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
Do you have logging enabled on any of your grid nodes? It would be really interesting to get a log file of the grid node's side of this happening. If you manage to capture one, could you zip it up and submit it through the contact form?
GreenMoose
#14 Posted : Wednesday, September 17, 2014 11:46:37 AM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 6/17/2012(UTC)
Posts: 503

Thanks: 142 times
Was thanked: 66 time(s) in 64 post(s)
FWIW, I have sortof the same issue: 5 grid nodes, if 1 is building and I want to queue test "resetDb" manually on all nodes it is only queued on the grid node building and I have to wait for it to complete before I can queue the test on the other servers.
Ralf Koban
#15 Posted : Wednesday, September 17, 2014 2:50:26 PM(UTC)
Rank: Advanced Member

Groups: Registered
Joined: 5/19/2014(UTC)
Posts: 44
Location: Germany

Thanks: 4 times
Was thanked: 10 time(s) in 9 post(s)
Hi Remco,

I've sent you the wanted logs via the contacts form. Hopefully it came through. If not, please let me know.

In that case some additional information:
After I got the logs I deactivated the logging on the grid nodes and the NCrunch service on the grid nodes was restarted. On my local machine NCrunch was now starting to execute the tests on the specific grid node.

BR,
Ralf
Remco
#16 Posted : Wednesday, September 17, 2014 10:08:54 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 6,976

Thanks: 931 times
Was thanked: 1257 time(s) in 1170 post(s)
@Ralf - Thanks for this. I'll follow up with you via email

@GreenMoose - It looks like there is a bug here. According to the code, if you try and queue a test to run on a specific node, trying to re-queue the same test for a different node will result in the originally queued task disappearing unless the engine has already started processing it. I don't think this is the same problem that Ralf is reporting but it's certainly worth fixing. Thanks for making me aware of it.
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.114 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download