Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

"Unable to write to file while transferring snapshot to node: some-nuget-package-dll"
Amarok
#1 Posted : Monday, October 22, 2018 10:48:17 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 9
Location: Austria

Thanks: 2 times
Hello!

We share a single grid node server (v3.22.0.1) between a handful of developers. After migrating our projects to the new PackageReference NuGet format,
we started to experience problems when using the grid node server. In the Distributed Processing window, we noticed error messages like:

Code:

"Unable to write to file while transferring snapshot to node: C:\Users\user\.nuget\packages\package-name\package-version\net471\some-dll

NCrunch was unable to write to the file at C:\Users\user\.nuget\packages\package-name\package-version\net471\some-dll when trying to transfer data to the grid node server.

It is possible the file is locked on the grid node by another process, or the grid node server does not have adequate permission to modify it.

If this file is not correctly aligned between the grid node and client, you may experience downstream issues when processing data using this node.

NCrunch will continue to synchronise data with this node and use it for further processing."



NCrunch stopped to build and run tests on the grid node server. Sometimes it works, sometimes not. After restarting the grid node it worked for some time.

We are not sure, but it seems that these errors and the related problems only start to appear when two developers use the grid node server at the same time. My suspicion is that the first developer machine deploys sources, nuget packages and the like to the grid node, the grid node then successfully builds everything and starts to run tests. Now, the second developer machine starts to deploy nuget packages and fails because the DLLs are locked by the still running tests.

It seems with PackageReference all developer machines deploy their NuGet packages to the same location, causing file locks to happen.

Could that be the case?

Kind regards,
Olaf Kober
Remco
#2 Posted : Monday, October 22, 2018 11:11:24 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,437

Thanks: 712 times
Was thanked: 888 time(s) in 844 post(s)
Hi Olaf,

Thanks for sharing this issue.

The problem here is that NCrunch is attempting to update the binary of a Nuget package while test runner processes are working on the grid node and making use of the same package. For NCrunch to be updating these package DLLs while they are in use, there must be a binary difference between the DLL on the grid node and on the client machine. So my bet here is that you have two clients using the same grid node and both clients have at least one Nuget package of the same name and version but with a different binary.

Normally you'd expect that a Nuget package of the same version and same package name should be identical, but sometimes they aren't. I don't think this is an expected outcome of the way Nuget works. It should be possible to fix the problem just by copying the Nuget binaries between your client machines so that they're the same.
Amarok
#3 Posted : Tuesday, October 23, 2018 8:03:03 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 9
Location: Austria

Thanks: 2 times
Hi Remco!

Thanks for your answer.

I'm a bit surprised that DLLs of same package version should be different on our client PCs as we use a central ProGet server to prevent exactly this scenario. So, I can't really believe that this should be the case, but I will check and come back to you.

Do you know reasons why NuGet behaves that way?

Kind regards,
Olaf
Amarok
#4 Posted : Tuesday, October 23, 2018 10:52:55 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 9
Location: Austria

Thanks: 2 times
Hello again!

Based on your feedback, I locked all developers out of our central NCrunch server, cleaned that machine, meaning I deleted all workspaces, snapshots, NuGet caches, etc. Then I did the same on my workstation and also on the workstation of another developer. So that we had a clean starting point for further investigation.

Before, I compared the NuGet packages (mainly the contained DLLs) on the different developer client machines. Here I found that we tricked ourselves a bit. We sometimes compile a DLL on local machine and put it into the local NuGet’s package cache to be able to build a dependent solution with updated library. That explains why we had binary-differences in DLLs of the exact same package version.

Well, that explains why we saw error messages for our own DLLs, but now after cleaning all machines, we still experience the same issue for third-party package like System.Threading.Tasks.Dataflow or MessagePack. Immediately after noticing these errors I compared the DLLs on the NCrunch server and the developer machine and found no differences; they are exactly the same. So, there must be another, second reason for those errors to appear.

Do you have an idea?


Not sure if this is related, but I noticed a slightly wrong path in the error detail message. Please, take a look at the screeshot: https://amarok.blob.core.windows.net/public/2018-10-23 12_29_44-Darwin.DMA - Microsoft Visual Studio.png

The detailed error message says:

“NCrunch was unable to write to the file at C:\Users\build\.nuget\packages\MessagePack\1.7.3.4\lib
et47\MessagePack.dll when trying to transfer data to the grid node server.”

Here the path “..\libet47\” is wrong. It should be “..\lib\net47\”.

Maybe that’s related, maybe not.


Kind regards,
Olaf
Amarok
#5 Posted : Tuesday, October 23, 2018 12:07:29 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 9
Location: Austria

Thanks: 2 times
Hi Remco!

I did another experiment, which result might be of interest.

Once again, I cleaned my NCrunch server machine and also my developer workstation. This time, I’m the only one operating on the NCrunch server.

I opened 5 different Visual Studio solutions; all using similar NuGet packages. I restored NuGet packages and checked that all solutions were using the same packages from one single package cache folder.

Then I enabled NCrunch in all 5 Visual Studio instances, causing NCrunch to synchronize simultaneously with the NCrunch server for the first time. That means copying sources but also all NuGet packages.

Since, all solutions were using the same NuGet sources and the same package cache folders, file mismatches shouldn’t happen in this case.

And then I again got various error messages saying it couldn’t write files while transferring snapshots to node. This happened after some of the solutions started to build and run tests, while other solutions were still transferring files.

IMHO, that seems to be a systematic problem.


Kind regards,
Olaf
Remco
#6 Posted : Tuesday, October 23, 2018 11:36:42 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,437

Thanks: 712 times
Was thanked: 888 time(s) in 844 post(s)
Hi Olaf,

I've just reviewed the code in question here to update my understanding of what is happening here and how all this works. I'll try to describe it in detail for you so that you can cross-check it with your own environment.

1. NCrunch client loads your projects and for each project, establishes a list of Nuget packages that are applicable to each project
2. All Nuget packages are enumerated and for every package, a 'surface hash' value is calculated which is derived from the name and file size of every file inside the Nuget package's directory. File contents are not included in this hash, just the sizes of the files and their names relative to the package's nuget directory.
3. The NCrunch client then connects to a remote grid node
4. The list of Nuget packages required for the solution along with their surface hash values is sent to the server
5. The server cross-checks the package list with the packages installed in its own Nuget package directory (typically %USERPROFILE%\.nuget\packages but can also be under your node's snapshot directory if it's set to run under the system account).
6. The server computes its own surface hash values from its own packages directory, and compares these to the hash values supplied by the client. Mismatched or missing packages are flagged up.
7. The server requests the mismatched or missing packages from the client
8. These files are then transferred to the server. Even if the files already exist with the same size on the server, they still get re-transferred. So each package is treated as a unit.

A few things of note:
- There is no concurrency control on the transfer of files to the grid node. This means that you have two clients trying to transfer the same Nuget package file at exactly the same time, only one of them will win.
- The error itself is not a critical one. As long as the Nuget package is intact enough for the CLR to be able to resolve the package files at runtime and work with them, your tests will run just fine. It just gets noisy.
- The packages and surface hash values for each package are stored on both the grid node and the client in-memory after they are first established. This means that if you modify the contents of the Nuget packages directory on the grid node, the grid node server will need to be restarted to update its view of the packages stored on the machine. Likewise, if you update the Nuget packages on the client machine after the NCrunch engine has initialised, you'll need to reset the engine for the changes to take effect.
- There is no consideration or manipulation by NCrunch of the Nuget package file names themselves. So if you have a malformed package file inside your Nuget package directory, that means something must be wrong in the Nuget packages on one of your machines.

The transfer of Nuget package files from client to server is expected to be a rare occurrence. It's the sort of thing that should only really happen the first time you use a grid node with a particular solution on the client side, or if you update to a new version of a Nuget package in your solution. If you're experiencing this regularly, then inconsistencies between the Nuget packages across the network is always the prime suspect (we've seen this happen on our end too).
Amarok
#7 Posted : Wednesday, October 24, 2018 10:16:35 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 9
Location: Austria

Thanks: 2 times
Hey Remco!

Many thanks for the detailed description of the inner workings.

That matches with my observations, except for one thing.


Quote:
The error itself is not a critical one. As long as the Nuget package is intact enough for the CLR to be able to resolve the package files at runtime and work with them, your tests will run just fine. It just gets noisy.


If I understand you correctly, even though the error message is generated the build and tests should continue to work. But, exactly that is not working for us. We observe clients to start transferring files and packages, then generate one or more of those error messages and then stand still until someone restarts the server. Disabling and re-enabling the client only restarts the transfer, which again generates an error and got stuck again (because of still locked DLLs). The transfer never completes and thus build and tests don't get started.

That's the real problem. Sorry, I didn't emphasize that enough in my original post.


Kind regards
Olaf
Remco
#8 Posted : Wednesday, October 24, 2018 10:54:26 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 5,437

Thanks: 712 times
Was thanked: 888 time(s) in 844 post(s)
Amarok;12791 wrote:

If I understand you correctly, even though the error message is generated the build and tests should continue to work. But, exactly that is not working for us. We observe clients to start transferring files and packages, then generate one or more of those error messages and then stand still until someone restarts the server. Disabling and re-enabling the client only restarts the transfer, which again generates an error and got stuck again (because of still locked DLLs). The transfer never completes and thus build and tests don't get started.


Sorry, I hadn't realised that was the case. I've noted this down to be addressed. Thanks for letting me know about it :)
Users browsing this topic
Guest
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.060 seconds.