Setting up Distributed Processing - Daily Usage Issues

Welcome Guest! To enable all features please Login or Register.

Notification

Error

NCrunch Forum » General Support » Daily Usage Issues » Setting up Distributed Processing

Setting up Distributed Processing

Options

Previous Topic Next Topic

nCubed		#1 Posted : Friday, July 11, 2014 4:55:02 PM(UTC)
Rank: Member Groups: Registered Joined: 11/1/2012(UTC) Posts: 23 Location: United States of America Thanks: 2 times Was thanked: 4 time(s) in 4 post(s)		Hey Remco, We have around 30 devs and are in the processes of testing Distributed Processing (DP) to help with the performance issues we're experiencing due to our solution having 155 projects (we cannot break apart the projects right now). Our standard dev setup is: Win 7 x64 Intel i7 3.4GHz (8 cores) 16 GB RAM Visual Studio 2012 NCrunch v2.latest ReSharper v8.latest We have a couple of VM's on our local network we are testing with DP: Win Server 2008 R2 standard x64 AMD 2.3 GHz (4 cores) 4.0 GB Ram When using NCrunch in w/o DP, we've been using the following setup: CPU Cores NCrunch: 0,1,2,3 CPU Cores Visual Studio: 4,5,6,7 Fast lane threads: 1 Max processing threads: 4 Max test runner pool: 1 Can you provide some recommendations for correctly configuring NCrunch as a starting point to be effective with DP? CPU Cores NCrunch: CPU Cores Visual Studio: Fast lane threads: Max processing threads: Max test runner pool: Or, maybe a better question is: How many clients can share the same node? In our tests, when more than on client shared the same node and we connected at the same time, NCrunch came to a crawl. Thanks!
Back to top

User Profile View All Posts by User View Thanks

Remco		#2 Posted : Friday, July 11, 2014 11:16:43 PM(UTC)
Rank: NCrunch Developer Groups: Administrators Joined: 4/16/2011(UTC) Posts: 7,339 Thanks: 991 times Was thanked: 1334 time(s) in 1237 post(s)		Hi, thanks for posting! The configuration needed in your situation will vary greatly depending upon the arrangement of the grid and the solution being tested. Often this will require some trial and error to get right, and you may need to do some monitoring of the resources on the machines involved (for example, keep a close eye on task manager or resource monitor on the grid nodes while they work). 155 projects is quite a large solution and you will notice this in your client->node synchronisation times, although this should not have a real impact on your response times unless you're using the 'Copy referenced assemblies to workspace' setting in which case NCrunch needs to perform extra build work. Can you share any more details about the nature of the performance problem you're experiencing? Is this simply manifesting in poor response times from the engine, or is there a problem with resource management (CPU/Memory) load on the grid node? The number of clients that can share the same node actually depends quite heavily upon the network. This is because the grid nodes need to try and manage the work they perform in order to make sure all clients get a fair go. When a grid node's execution thread becomes available for work, it will poll out to each client synchronously asking for more work. If there are a large number of clients connected and the network is particularly slow (i.e. if you're using cloud servers based on another continent), you will notice this in the responsiveness of the grid node. Something else to watch out for is synchronisation operations between the client and node. The node manages its work using a single thread, which right now is also used for synchronising it with new client connections (excluding data upload). Synchronisation actions on large solutions can sometimes take upwards of a minute on a slow grid node, and this will tie up the node making it unable to perform work for other clients. This is expected to change in future. It may be worth trying to keep track of your performance issues to see if there is any relationship with new clients connecting to the grid nodes.
Back to top

User Profile View All Posts by User View Thanks

nCubed		#3 Posted : Tuesday, July 15, 2014 6:56:06 PM(UTC)
Rank: Member Groups: Registered Joined: 11/1/2012(UTC) Posts: 23 Location: United States of America Thanks: 2 times Was thanked: 4 time(s) in 4 post(s)		Hi Remco, The last paragraph, re: single thread, makes sense for some of the performance issues with the distributed nodes. We did a test with 3 of us all connecting at the same time and the node simply hung. Most of the performance issues we've seen are around Visual Studio becoming very sluggish to the point of being unresponsive for a few moments. Some of the devs have just disabled NCrunch because of this. We realize that the solution is quite large and there's nothing we can do for now to break it down into smaller solutions (think legacy software). Interesting enough when I spun up an Azure server in our region, I had a really good experience, but I was the only one connecting. When we spun up a couple of local VM's, things just lagged, but then again, we were aggressively testing the local VM nodes with 3 connections. I suspect there may be a couple of issues with our local VM nodes, one of them being a much slower connection than I can get to an Azure VM. If at some point you'd like to schedule a screen share where we can walk through our solution/configuration, I'd be happy to do so. Especially given that we have 30 devs on NCrunch now. Thanks!
Back to top

User Profile View All Posts by User View Thanks

Remco		#4 Posted : Tuesday, July 15, 2014 9:40:28 PM(UTC)
Rank: NCrunch Developer Groups: Administrators Joined: 4/16/2011(UTC) Posts: 7,339 Thanks: 991 times Was thanked: 1334 time(s) in 1237 post(s)		I think this is starting to make some sense. Looking at the specifications you've given me, it appears that there is a substantial difference between the node VMs and the dev workstations. Because the nodes only have 4GB RAM (vs dev at 16GB), it's possible that they're forced to use the swap file more heavily while processing for multiple connections. This will likely be more noticeable if they are using mechanical hard drives (not SSDs). I think it would really be worth monitoring the performance of these VMs while they are processing with a heavy load. This will help us to establish whether the VM is suffering resource issues, and perhaps give us more information about where the limitations are (CPU, HDD, RAM, etc). I can then suggest some configuration options that may be able to help. The hanging-on-new-connection issue should be fairly easy to deduct from your scenario if you make sure that all clients are fully connected before assessing the performance of the node. I anticipate this issue will be more irritating if the node is regularly dropping connections (i.e. if the network is unreliable or something else is not right). Can you share any details on how long it usually takes a client to fully connect and synchronise with a node?
Back to top

User Profile View All Posts by User View Thanks

nCubed		#5 Posted : Wednesday, July 16, 2014 12:35:12 AM(UTC)
Rank: Member Groups: Registered Joined: 11/1/2012(UTC) Posts: 23 Location: United States of America Thanks: 2 times Was thanked: 4 time(s) in 4 post(s)		Hi Remco, I had take some screen shots of the VM resource monitor and it was indeed pegged at nearly 100% while working with NCrunch; when idle (no NCrunch activity), it was sitting at the normal 5% blips. The VM's and Dev machines are both on mechanical hard drives (7200 RPM I believe). I'll have to spend some time testing the various scenarios to see if I can find the bottleneck. At the end of the day, if we can get NCrunch working without the nodes, then that would be ideal since we wouldn't have to deal with the nodes' configuration for each dev. With regards to connection times (with small 6 project solution, 3 test project): Local (no nodes): - Resync, Rebuild, Rerun all tests: ~7 seconds VM (local disabled, no other nodes): - Reset service, initialize, load snapshot, transfer files: ~23 seconds - Process: ~22 seconds - Connection typically takes a 1 second. Azure VM (local disabled, no other nodes) 8 Logical Cores / 4 Physical Cores / 16GB RAM: - Reset service, initialize, load snapshot, transfer files and process: ~20 seconds - Process: included in above timing - Connection typically takes a 1 second. Clearly, the local VM is severely under-powered in comparison to the Azure VM. I realize the tests are not exactly the same tests, but they should give you some insight. FWIW: I did enable all 3 nodes for this sample project and the 3 of them combined didn't seem to make much difference if I just ran the local machine by itself. I'll have to try a similar test with the full 155 production solution at a later date. Thanks!
Back to top

User Profile View All Posts by User View Thanks

Remco		#6 Posted : Wednesday, July 16, 2014 1:20:52 AM(UTC)
Rank: NCrunch Developer Groups: Administrators Joined: 4/16/2011(UTC) Posts: 7,339 Thanks: 991 times Was thanked: 1334 time(s) in 1237 post(s)		Great, this is really useful information. With the node being at 100% CPU, the problem is definitely related to its ability to keep up with the number of execution threads assigned to it. If you drop the max number of processing threads on the VM node to something lower (i.e. 2 or 3), you may see an improvement in response times. As a general target you'll want to see the node's CPU fluctuating between 70% and 100% CPU while it's under maximum load. If it stays stuck at 100% CPU, then most likely it's unable to keep up with the demand. This might be a good thing for overall throughput (i.e. you're not wasting any CPU), but it's bad for response times, as tests and builds will take longer to finish. The node also needs a bit of CPU to be able to manage the connections and exchange data with clients. A useful test is sometimes to write a series of very CPU intensive tests (just a big for-loop that concatenates strings is often enough), then set these to work on the grid node. Compare the execution time of these tests with what you see when running them locally on your workstation. If there is a big difference in the processing times, then it's a sure sign that the node is either underpowered relative to the workstation, or it's overloaded. What is the nature of the virtualisation environment you're using to run the node VMs? Do you have any opportunity to consolidate resources into fewer but larger VMs? Doing so will reduce the overhead of the engine and it may give you a bit more power to work with. Considering the size of your solution and the specification of your workstations, the benefits of using the NCrunch grid for you may be limited to very specific scenarios that would depend upon what you are trying to test. Your workstations will easily outperform the node VMs in their current spec, and they aren't shared resources, so you may want them to be handling the bulk of your processing while delegating very specific tests to the nodes. Underpowered nodes can sometimes be useful for running tests that are problematic or inconvenient to run on workstations. For example, if you have tests that need to interact with the UI, having them popping up windows all over a workstation's desktop can be a problem for continuous testing, but not when they are run on a remote server. It may also be useful to use the nodes to test the performance of code that may behave differently in environments with constrained resources (i.e. testing for race conditions in multi-threaded code). You can use capabilities to determine where the tests should be run on the grid. Something else to consider is that for some solutions, additional processing power can have limited (or even negative) benefit. For example, if you have a solution with 20,000 tests with a total execution time of 20 seconds to run synchronously end-to-end, the overhead of splitting up and managing all of these tests across multiple machines would likely increase the overall processing time well beyond the normal 20 seconds. This is in contrast to a solution with 20,000 tests where each test takes around 5 seconds to run (total end-to-end time of 28 hours), where splitting the tests up across multiple machines could achieve a very significant reduction in overall processing time.
Back to top

User Profile View All Posts by User View Thanks

nCubed		#7 Posted : Thursday, July 17, 2014 7:08:51 PM(UTC)
Rank: Member Groups: Registered Joined: 11/1/2012(UTC) Posts: 23 Location: United States of America Thanks: 2 times Was thanked: 4 time(s) in 4 post(s)		Remco, I wanted to give you a heads up that we are doing some additional testing and will get back with you in a week or so. Thank you for your excellent feedback and support! Thanks!
Back to top
1 user thanked nCubed for this useful post.		Remco on 7/17/2014(UTC)
User Profile View All Posts by User View Thanks

Users browsing this topic
Guest

NCrunch Forum » General Support » Daily Usage Issues » Setting up Distributed Processing

Forum Jump

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.