bartj;15161 wrote:
While in an ideal world, tests would never time out (and of course they would always pass), in the messy world of legacy software, tests will not complete from time to time - particularly when under heavy concurrent load. This concurrency is part of what we use NCrunch for. We currently use NUnit single threaded on the CI server because it is reliable and works well, however we want to be able to complete a console test run locally leveraging NCrunch's faster execution times through concurrency.
Understood! When you think about it, the act of running tests over untested code is all about trying to surface these kinds of issues. If our testing systems can pick up the issues, then we can keep them out of production environments. It's unfortunate that the stresses we inflict with our testing can also affect the stability of the environment we use for this testing, and as such the act of testing our software in this manner is a constant battle against intermittent problems that can be hard to pin down. If you haven't used it already, I strongly recommend making use of the churn mode feature to try and selectively trigger these sorts of problems in a desktop session where they may be easier to analyse. It's no fun when the CI blows up or hangs for an issue that only appears when you are least able to inspect it.
bartj;15161 wrote:
Could you please explain what we might be doing that could prevent the NCrunch process from killing a child process? This seems like something that is always possible, unless the process has frozen somehow in the Windows kernel. I also haven't seen any logging that would suggest that the kill was attempted and failed. Any hints on what I should be looking for in the logs?
Unfortunately, I have no way to answer this question, as to do so would involve random speculation on the workings of your code and the functioning of your entire environment.
In terms of trying to flush out and understand these problems, here are my suggestions:
- Churn mode is a great way to make intermittent problems consistent. Use it regularly. Churn selective parts of your test suite that are suspected to contain intermittent issues.
- You can actually implement your own timeout handling inside your test code. For example, when the test starts, try starting a background thread that will make a call to System.Diagnostics.Debugger.Launch() if the test thread doesn't reach the end of the test in an allowable time. By doing this, you can interrupt your own code in the process of a timeout (before NCrunch attempts any enforcement), and have a good chance of understanding why it happened.
- Turn on the
terminate test runners on complete setting, then let the run continue until it hangs. The only TestHost processes that remain are categorically hung, because otherwise they would be terminated. You can hook the VS debugger onto these processes to examine their state and try to understand what happened.
- Try to build resilience into your test architecture. Add levels of exception handling and recovery. Instead of launching background threads directly from your production code, implement custom thread dispatchers that can be controlled by the tests, then track the background threads so you can identify and act on situations where these threads don't behave as they should. Your tests are clearly not just isolated unit tests, they are big, functional, complex and very valuable. Treat their operation with suspicion so that you can be more aware when race conditions appear and the code doesn't behave as it should.
- NCrunch.Framework.IsolatedAttribute is your friend. It works absolute magic when executing tests that might leave the process in an inconsistent state and affect tests executed later in the run.