Welcome Guest! To enable all features please Login or Register.

Notification

Icon
Error

2 Pages12>
"Unable to write to file while transferring snapshot to node: some-nuget-package-dll"
Amarok
#1 Posted : Monday, October 22, 2018 10:48:17 AM(UTC)
Rank: Member

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 11
Location: Austria

Thanks: 2 times
Was thanked: 1 time(s) in 1 post(s)
Hello!

We share a single grid node server (v3.22.0.1) between a handful of developers. After migrating our projects to the new PackageReference NuGet format,
we started to experience problems when using the grid node server. In the Distributed Processing window, we noticed error messages like:

Code:

"Unable to write to file while transferring snapshot to node: C:\Users\user\.nuget\packages\package-name\package-version\net471\some-dll

NCrunch was unable to write to the file at C:\Users\user\.nuget\packages\package-name\package-version\net471\some-dll when trying to transfer data to the grid node server.

It is possible the file is locked on the grid node by another process, or the grid node server does not have adequate permission to modify it.

If this file is not correctly aligned between the grid node and client, you may experience downstream issues when processing data using this node.

NCrunch will continue to synchronise data with this node and use it for further processing."



NCrunch stopped to build and run tests on the grid node server. Sometimes it works, sometimes not. After restarting the grid node it worked for some time.

We are not sure, but it seems that these errors and the related problems only start to appear when two developers use the grid node server at the same time. My suspicion is that the first developer machine deploys sources, nuget packages and the like to the grid node, the grid node then successfully builds everything and starts to run tests. Now, the second developer machine starts to deploy nuget packages and fails because the DLLs are locked by the still running tests.

It seems with PackageReference all developer machines deploy their NuGet packages to the same location, causing file locks to happen.

Could that be the case?

Kind regards,
Olaf Kober
Remco
#2 Posted : Monday, October 22, 2018 11:11:24 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Hi Olaf,

Thanks for sharing this issue.

The problem here is that NCrunch is attempting to update the binary of a Nuget package while test runner processes are working on the grid node and making use of the same package. For NCrunch to be updating these package DLLs while they are in use, there must be a binary difference between the DLL on the grid node and on the client machine. So my bet here is that you have two clients using the same grid node and both clients have at least one Nuget package of the same name and version but with a different binary.

Normally you'd expect that a Nuget package of the same version and same package name should be identical, but sometimes they aren't. I don't think this is an expected outcome of the way Nuget works. It should be possible to fix the problem just by copying the Nuget binaries between your client machines so that they're the same.
Amarok
#3 Posted : Tuesday, October 23, 2018 8:03:03 AM(UTC)
Rank: Member

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 11
Location: Austria

Thanks: 2 times
Was thanked: 1 time(s) in 1 post(s)
Hi Remco!

Thanks for your answer.

I'm a bit surprised that DLLs of same package version should be different on our client PCs as we use a central ProGet server to prevent exactly this scenario. So, I can't really believe that this should be the case, but I will check and come back to you.

Do you know reasons why NuGet behaves that way?

Kind regards,
Olaf
Amarok
#4 Posted : Tuesday, October 23, 2018 10:52:55 AM(UTC)
Rank: Member

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 11
Location: Austria

Thanks: 2 times
Was thanked: 1 time(s) in 1 post(s)
Hello again!

Based on your feedback, I locked all developers out of our central NCrunch server, cleaned that machine, meaning I deleted all workspaces, snapshots, NuGet caches, etc. Then I did the same on my workstation and also on the workstation of another developer. So that we had a clean starting point for further investigation.

Before, I compared the NuGet packages (mainly the contained DLLs) on the different developer client machines. Here I found that we tricked ourselves a bit. We sometimes compile a DLL on local machine and put it into the local NuGet’s package cache to be able to build a dependent solution with updated library. That explains why we had binary-differences in DLLs of the exact same package version.

Well, that explains why we saw error messages for our own DLLs, but now after cleaning all machines, we still experience the same issue for third-party package like System.Threading.Tasks.Dataflow or MessagePack. Immediately after noticing these errors I compared the DLLs on the NCrunch server and the developer machine and found no differences; they are exactly the same. So, there must be another, second reason for those errors to appear.

Do you have an idea?


Not sure if this is related, but I noticed a slightly wrong path in the error detail message. Please, take a look at the screeshot: (- BROKEN LINK -) 12_29_44-Darwin.DMA - Microsoft Visual Studio.png](- BROKEN LINK -) 12_29_44-Darwin.DMA - Microsoft Visual Studio.png[/url]

The detailed error message says:

“NCrunch was unable to write to the file at C:\Users\build\.nuget\packages\MessagePack\1.7.3.4\lib
et47\MessagePack.dll when trying to transfer data to the grid node server.”

Here the path “..\libet47\” is wrong. It should be “..\lib\net47\”.

Maybe that’s related, maybe not.


Kind regards,
Olaf
Amarok
#5 Posted : Tuesday, October 23, 2018 12:07:29 PM(UTC)
Rank: Member

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 11
Location: Austria

Thanks: 2 times
Was thanked: 1 time(s) in 1 post(s)
Hi Remco!

I did another experiment, which result might be of interest.

Once again, I cleaned my NCrunch server machine and also my developer workstation. This time, I’m the only one operating on the NCrunch server.

I opened 5 different Visual Studio solutions; all using similar NuGet packages. I restored NuGet packages and checked that all solutions were using the same packages from one single package cache folder.

Then I enabled NCrunch in all 5 Visual Studio instances, causing NCrunch to synchronize simultaneously with the NCrunch server for the first time. That means copying sources but also all NuGet packages.

Since, all solutions were using the same NuGet sources and the same package cache folders, file mismatches shouldn’t happen in this case.

And then I again got various error messages saying it couldn’t write files while transferring snapshots to node. This happened after some of the solutions started to build and run tests, while other solutions were still transferring files.

IMHO, that seems to be a systematic problem.


Kind regards,
Olaf
Remco
#6 Posted : Tuesday, October 23, 2018 11:36:42 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Hi Olaf,

I've just reviewed the code in question here to update my understanding of what is happening here and how all this works. I'll try to describe it in detail for you so that you can cross-check it with your own environment.

1. NCrunch client loads your projects and for each project, establishes a list of Nuget packages that are applicable to each project
2. All Nuget packages are enumerated and for every package, a 'surface hash' value is calculated which is derived from the name and file size of every file inside the Nuget package's directory. File contents are not included in this hash, just the sizes of the files and their names relative to the package's nuget directory.
3. The NCrunch client then connects to a remote grid node
4. The list of Nuget packages required for the solution along with their surface hash values is sent to the server
5. The server cross-checks the package list with the packages installed in its own Nuget package directory (typically %USERPROFILE%\.nuget\packages but can also be under your node's snapshot directory if it's set to run under the system account).
6. The server computes its own surface hash values from its own packages directory, and compares these to the hash values supplied by the client. Mismatched or missing packages are flagged up.
7. The server requests the mismatched or missing packages from the client
8. These files are then transferred to the server. Even if the files already exist with the same size on the server, they still get re-transferred. So each package is treated as a unit.

A few things of note:
- There is no concurrency control on the transfer of files to the grid node. This means that you have two clients trying to transfer the same Nuget package file at exactly the same time, only one of them will win.
- The error itself is not a critical one. As long as the Nuget package is intact enough for the CLR to be able to resolve the package files at runtime and work with them, your tests will run just fine. It just gets noisy.
- The packages and surface hash values for each package are stored on both the grid node and the client in-memory after they are first established. This means that if you modify the contents of the Nuget packages directory on the grid node, the grid node server will need to be restarted to update its view of the packages stored on the machine. Likewise, if you update the Nuget packages on the client machine after the NCrunch engine has initialised, you'll need to reset the engine for the changes to take effect.
- There is no consideration or manipulation by NCrunch of the Nuget package file names themselves. So if you have a malformed package file inside your Nuget package directory, that means something must be wrong in the Nuget packages on one of your machines.

The transfer of Nuget package files from client to server is expected to be a rare occurrence. It's the sort of thing that should only really happen the first time you use a grid node with a particular solution on the client side, or if you update to a new version of a Nuget package in your solution. If you're experiencing this regularly, then inconsistencies between the Nuget packages across the network is always the prime suspect (we've seen this happen on our end too).
Amarok
#7 Posted : Wednesday, October 24, 2018 10:16:35 AM(UTC)
Rank: Member

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 11
Location: Austria

Thanks: 2 times
Was thanked: 1 time(s) in 1 post(s)
Hey Remco!

Many thanks for the detailed description of the inner workings.

That matches with my observations, except for one thing.


Quote:
The error itself is not a critical one. As long as the Nuget package is intact enough for the CLR to be able to resolve the package files at runtime and work with them, your tests will run just fine. It just gets noisy.


If I understand you correctly, even though the error message is generated the build and tests should continue to work. But, exactly that is not working for us. We observe clients to start transferring files and packages, then generate one or more of those error messages and then stand still until someone restarts the server. Disabling and re-enabling the client only restarts the transfer, which again generates an error and got stuck again (because of still locked DLLs). The transfer never completes and thus build and tests don't get started.

That's the real problem. Sorry, I didn't emphasize that enough in my original post.


Kind regards
Olaf
Remco
#8 Posted : Wednesday, October 24, 2018 10:54:26 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Amarok;12791 wrote:

If I understand you correctly, even though the error message is generated the build and tests should continue to work. But, exactly that is not working for us. We observe clients to start transferring files and packages, then generate one or more of those error messages and then stand still until someone restarts the server. Disabling and re-enabling the client only restarts the transfer, which again generates an error and got stuck again (because of still locked DLLs). The transfer never completes and thus build and tests don't get started.


Sorry, I hadn't realised that was the case. I've noted this down to be addressed. Thanks for letting me know about it :)
Spielosoph
#9 Posted : Friday, November 30, 2018 12:34:15 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 11/30/2018(UTC)
Posts: 8
Location: Germany

Was thanked: 1 time(s) in 1 post(s)
Hello,

we have the same problem with the grid node server.
Server and clients have v3.22.0.1.

It seems the grid node server has problem with Nuget package references (Not the old way with the packages.config).

Many developers use the same grid node servers.
Only one wins and all the others will get the mentioned error and Ncrunch will just build the solution locally.

(- BROKEN LINK -)(- BROKEN LINK -)[/url]



Remco
#10 Posted : Saturday, December 1, 2018 1:20:40 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Spielosoph;12865 wrote:
Hello,

we have the same problem with the grid node server.
Server and clients have v3.22.0.1.

It seems the grid node server has problem with Nuget package references (Not the old way with the packages.config).

Many developers use the same grid node servers.
Only one wins and all the others will get the mentioned error and Ncrunch will just build the solution locally.


Hi, thanks for sharing this.

We're looking at ways we can improve this in the product itself, but for the time being, the best solution is to make sure that the Nuget packages you have between your client machines are consistent (xcopy them across).

It's beyond me why we're seeing packages now with the same version but different binary content. To me this seems like a huge problem with the package manager in general.
Spielosoph
#11 Posted : Monday, December 3, 2018 9:15:27 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 11/30/2018(UTC)
Posts: 8
Location: Germany

Was thanked: 1 time(s) in 1 post(s)
Hello,

i do not think this will resolve our problem.

Our theory is this:
The first developer locks the file and all subsequent developers get the exception.
That is what the exception text (see image) says.

The Nuget old way (with packages.config) used relative paths.
The new way copies them to a user specific folder. All developers use the same grid node user, so all developers want to write the same file.
On the local machine this would be "C:\Users\[My Name]\.nuget\packages" for me, which works fine.
On the grid node Server it seems to be "C:\NCrunch Grid Node\Snapshots\.nuget\" which does not work, because all developers try to write and use these files.
Remco
#12 Posted : Monday, December 3, 2018 11:48:17 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
The problem is caused by developer's machines having binary differences between the Nuget packages in their Nuget package cache directories (i.e. C:\Users\[My Name]\.nuget\packages).

Since Nuget moved to using a user/global package cache for packages being referenced in a solution (i.e. no longer using packages.config), the packages need to be stored in a centralised location on the grid node to prevent servers from running out of space. The nuget package directory cache tends to be particularly large (often 5+ GB), so duplicating this for every user using the grid node isn't really a viable solution for most environments.

Assuming Nuget is working correctly, it should be safe to do this, because the same version of the same package should have the same binary content, so it should be safe to share a package between multiple users on a grid node. The way the package manager works by storing packages in a global place on your own machine implies that this is the way Nuget is designed to work (i.e. a new kind of GAC).

Having a global storage system like the Nuget package cache is somewhat of a problem when using a grid node, because it isn't safe to assume that the grid node has all the same packages installed on it as are being used by a client machine. A couple of years ago a mechanism was introduced into NCrunch to automatically copy over Nuget packages from the grid client machine to the server node, but only when the server doesn't have those packages or when the packages it has in its cache directory have binary differences to those on the client.

When everything is as it should be, the packages usually get copied over the first time you use a grid node with a particular solution. Then they stay there, and don't need to be copied again. When other developers use the node with the same solution, their packages should in theory already be identical to any pre-copied packages on the node, so they don't need to be copied either.

But what is happening is that we have at least one package installed on multiple developer machines that despite having the same package name and package version is not binary equivalent. So you have two developers with the same package with a matched version number that is actually a different package. This means that when user #2 connects to the grid node, the system notices their package is different and tries to copy it to the grid node, but can't because an existing package with the same name and version is already in use. So we get an exception and the connection attempt fails.

When you think about it, this is actually a very alarming problem. Binary differences between packages of the same name and version should not exist, because they give the potential for wildly different behaviour. You could have code running on one developer machine that behaves entirely different from another one, even though all the same things are supposed to be installed.

A problem like this is very hard to solve in NCrunch itself, because there isn't really a right answer to it. As noted earlier in this thread, the exception is supposed to be benign and shouldn't discontinue the connection attempt (we are working on fixing this), but there is still the question of which package the grid node should use. It may be that we need to implement a configuration setting to make the grid node store the nuget packages with the snapshot rather than in a global location, though this would cause a massive increase in disk space consumption on the node.

In my view, the correct solution is actually to avoid making use of different packages with the same name and version. Choose a developer machine with a 'master' set of nuget packages, wipe the package caches on all the other machines, then copy the packages over from the master. The reporting of this problem doesn't give me enough detail on what the nature of these differences are, but in the worst case scenarios they could give you much more serious problems than a failed NCrunch grid.
Argamon
#13 Posted : Tuesday, December 4, 2018 2:48:35 PM(UTC)
Rank: Member

Groups: Registered
Joined: 12/4/2018(UTC)
Posts: 12
Location: Germany

Thanks: 1 times
Was thanked: 2 time(s) in 2 post(s)
Maybe it is just a visual glitch, but in the summary of the error list it shows:

Unable to write to file while transferring snapshot to node: C:\TFS\Snapshots\.nuget\AutoFixture.Idioms\4.5.0\lib\net452\AutoFixture.Idioms.dll

If I click on it and read the error message it shows (I marked the strange part in bold):

NCrunch was unable to write to the file at C:\TFS\Snapshots\.nuget\AutoFixture.Idioms\4.5.0\libet452\AutoFixture.Idioms.dll when trying to transfer data to the grid node server.

It is possible the file is locked on the grid node by another process, or the grid node server does not have adequate permission to modify it.

If this file is not correctly aligned between the grid node and client, you may experience downstream issues when processing data using this node.

NCrunch will continue to synchronise data with this node and use it for further processing.


Maybe it is just the part creating the message or there lies the real problem.
Remco
#14 Posted : Tuesday, December 4, 2018 10:49:33 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Argamon;12881 wrote:

[i]NCrunch was unable to write to the file at C:\TFS\Snapshots\.nuget\AutoFixture.Idioms\4.5.0\libet452\AutoFixture.Idioms.dll when trying to transfer data to the grid node server.


This looks very suspect to me.

NCrunch doesn't perform any kind of processing or parsing of the package DLL file paths. It gets handed them straight from MSBuild. My bet here would be that you have at least one grid client with this erroneous path in its cache directory. This could be the binary difference causing files to be recopied.
Spielosoph
#15 Posted : Tuesday, December 11, 2018 7:07:17 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 11/30/2018(UTC)
Posts: 8
Location: Germany

Was thanked: 1 time(s) in 1 post(s)
[img](- BROKEN LINK -)[/img]

I included an image of NCrunch where you can see that the paths are different in the caption and in the first line of the message.

Can you please tell us where we have to check our paths?
Remco
#16 Posted : Wednesday, December 12, 2018 6:45:31 AM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
The paths of your local Nuget cache on your client machines will usually be under the user profile at C:\Users\USER\.nuget\packages, although this is also configurable. I would recommend clearing out this cache on all of your client machines and allowing Visual Studio to pull the packages down again so that they are binary consistent.

I'm wondering if you might be interested in giving the following build a try:

NCrunch_Console_3.23.0.7.msi
NCrunch_Console_3.23.0.7.zip
NCrunch_GridNodeServer_3.23.0.7.msi
NCrunch_GridNodeServer_3.23.0.7.zip
NCrunch_LicenseServer_3.23.0.7.zip
NCrunch_VS2008_3.23.0.7.msi
NCrunch_VS2010_3.23.0.7.msi
NCrunch_VS2010_3.23.0.7.zip
NCrunch_VS2012_3.23.0.7.msi
NCrunch_VS2012_3.23.0.7.zip
NCrunch_VS2013_3.23.0.7.msi
NCrunch_VS2013_3.23.0.7.zip
NCrunch_VS2015_3.23.0.7.msi
NCrunch_VS2015_3.23.0.7.msi.7z
NCrunch_VS2015_3.23.0.7.zip
NCrunch_VS2017_3.23.0.7.msi
NCrunch_VS2017_3.23.0.7.msi.7z
NCrunch_VS2017_3.23.0.7.zip


I'm not particularly happy with how the current release handles this situation, so the above build introduces 3 separate fixes in this area to try and improve the experience:

1. When the file transfer/locking error appears, the grid node should continue with its synchronisation and normal function as though the problem didn't happen (as the message currently suggests).
2. The grid node server will no longer attempt to re-transfer Nuget packages that are in-use by other grid clients. This means that the error actually shouldn't appear anymore.
3. I've found an issue where having excess files stored in the grid server's Nuget cache directory could mess with the hashing system and cause packages to be re-transferred even when they didn't need to be. I strongly suspect this issue is also appearing in your environment.

I'm hopeful that at least one of the above fixes should resolve this problem for you, as far as NCrunch is concerned. I still recommend clearing out your client's nuget caches and making sure these are aligned, however.
Spielosoph
#18 Posted : Monday, December 17, 2018 2:06:27 PM(UTC)
Rank: Newbie

Groups: Registered
Joined: 11/30/2018(UTC)
Posts: 8
Location: Germany

Was thanked: 1 time(s) in 1 post(s)
We installed the hotfix on our machines.
We get a lot of "Cannot find contract with id: 4096" error messages (see image).
The file locks seem to be gone though.

[img](- BROKEN LINK -)[/img]
Remco
#19 Posted : Monday, December 17, 2018 10:55:30 PM(UTC)
Rank: NCrunch Developer

Groups: Administrators
Joined: 4/16/2011(UTC)
Posts: 7,161

Thanks: 964 times
Was thanked: 1296 time(s) in 1202 post(s)
Hi, this is the sort of error that could happen if the grid node and client are running on different versions. Can you confirm that all machines have been updated to the new version?
Amarok
#17 Posted : Tuesday, December 18, 2018 9:31:12 AM(UTC)
Rank: Member

Groups: Registered
Joined: 9/27/2013(UTC)
Posts: 11
Location: Austria

Thanks: 2 times
Was thanked: 1 time(s) in 1 post(s)
Remco;12918 wrote:

I'm not particularly happy with how the current release handles this situation, so the above build introduces 3 separate fixes in this area to try and improve the experience:

1. When the file transfer/locking error appears, the grid node should continue with its synchronisation and normal function as though the problem didn't happen (as the message currently suggests).
2. The grid node server will no longer attempt to re-transfer Nuget packages that are in-use by other grid clients. This means that the error actually shouldn't appear anymore.
3. I've found an issue where having excess files stored in the grid server's Nuget cache directory could mess with the hashing system and cause packages to be re-transferred even when they didn't need to be. I strongly suspect this issue is also appearing in your environment.

I'm hopeful that at least one of the above fixes should resolve this problem for you, as far as NCrunch is concerned. I still recommend clearing out your client's nuget caches and making sure these are aligned, however.



We are going to test that pre-release too. Point 3 sounds promising. Might be our problem, because we are consuming really a lot of packages...
Spielosoph
#20 Posted : Tuesday, December 18, 2018 10:58:51 AM(UTC)
Rank: Newbie

Groups: Registered
Joined: 11/30/2018(UTC)
Posts: 8
Location: Germany

Was thanked: 1 time(s) in 1 post(s)
Remco;12931 wrote:
Hi, this is the sort of error that could happen if the grid node and client are running on different versions. Can you confirm that all machines have been updated to the new version?

You were right.
One person still used the old version.

Verdict:
Seems to work.
Thank you
1 user thanked Spielosoph for this useful post.
Remco on 12/18/2018(UTC)
Users browsing this topic
Guest (2)
2 Pages12>
Forum Jump  
You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

YAF | YAF © 2003-2011, Yet Another Forum.NET
This page was generated in 0.145 seconds.
Trial NCrunch
Take NCrunch for a spin
Do your fingers a favour and supercharge your testing workflow
Free Download