Hi.
I've been chasing a nasty heap corruption bug for a while and I think I finally found the culprit in Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead
.
Using WinDbg's !verifyheap
command I found that the heap corruption was always caused by a string similar to this one overwritting something:
{"device.host_status":"030,0,1,0642,003,0,0,0,000,0,0,0\r\n001,0,0,0,1,2,6,0,00000001,1,000\r\n0000,0"}
Which is the output of the following SGD command:
{}{"device.host_status":null}
This points to the Printer.GetCurrentStatus
method.
Using dotPeek I looked at the decompiled source code of the SDK (version 2.16.2905), tracing the code for GetCurrentStatus
until I arrived at Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead
where I found something fishy.
Here's the decompiled source code for BeginRead
:
public override unsafe IAsyncResult BeginRead(
byte[] buffer,
int offset,
int count,
AsyncCallback callback,
object state)
{
NativeMethods.SetLastError(0U);
GCHandle gcHandle = GCHandle.Alloc((object) buffer, GCHandleType.Pinned);
UsbAsyncResult ar = new UsbAsyncResult(state, callback, true);
Overlapped overlapped1 = new Overlapped(0, 0, ar.AsyncWaitHandle.SafeWaitHandle.DangerousGetHandle(), (IAsyncResult) ar);
try
{
byte[] userData = new byte[count];
NativeOverlapped* overlapped2 = overlapped1.Pack(this.m_IOCompletionCallback, (object) userData);
byte* buffer1 = (byte*) ((ulong) gcHandle.AddrOfPinnedObject().ToInt64() + (ulong) offset);
if (gcHandle.IsAllocated)
gcHandle.Free();
if (NativeMethods.ReadFile(this.m_hFile, buffer1, (uint) count, out uint _, overlapped2))
{
ar.m_bCompletedSynchronously = true;
return (IAsyncResult) ar;
}
int lastWin32Error = Marshal.GetLastWin32Error();
if (lastWin32Error == 997)
return (IAsyncResult) ar;
throw new IOException("Unable to initialize read. Error code: " + lastWin32Error.ToString());
}
finally
{
if (gcHandle.IsAllocated)
gcHandle.Free();
}
}
I think buffer
is being moved by the GC after the assignment of the pointer buffer1
and before ReadFile
finishes.
This could be avoided if buffer
was passed to overlapped1.Pack
instead of userData
.
From the documentation for Overlapped.Pack:
Remarks
[...]
The buffer or buffers specified in
userData
must be the same as those passed to the unmanaged operating system function that performs the asynchronous I/O.Note:
The runtime pins the buffer or buffers specified in
userData
for the duration of the I/O operation. If the application domain is unloaded, the runtime keeps the memory pinned until the I/O operation completes.
I patched that change into SdkApi.Desktop.Usb.dll using dnSpy and ran a 48 hour stress test and the issue has not reappeared (before, it crashed after about 4 to 8 hours). It could be luck, though. I'll continue the stress test as long as I can and report back.
I can provide a few memory dumps if necessary.
3 Replies
After 6 days of continuous stress test with the patched DLL, the issue has still not reappeared.
This is an interesting finding. Thanks for reporting and sharing solution. How do you conduct the stress test? Do you just keep calling the getCurrentStatus() at a fixed interval?
I conducted the stress test using a program I'm developing. It's not a minimal reproducible example by any means, but it's not that complex either.
It's a WPF app which basically boils down to calling
getCurrentStatus
and printing a label every time aNumberBox
is increased. To keep the UI responsive while avoiding two threads from talking to the printer at the same time, there's a queue implemented with a TPLDataFlow
which processes the print jobs one at a time onThreadPool
threads. For the test, I added a timer which increases theNumberBox
every two seconds.I ran the test using a paused ZD421. To avoid filling the printer's buffer, I send
~JA
every 5 prints (using the queue).To try and catch the error as close as possible to the corruption, I set the following flags using GFlags: Enable heap tail checking, Enable heap free checking, Enable heap parameter checking, Enable heap validation on call, Enable page heap, Enable heap tagging, Enable application verifier.
I think the important parts for reproducing are calling
getCurrentStatus
in a background thread while on the main thread having enough allocations and deallocations to trigger heap compaction. Also, accessing objects in the heap so corruption is eventually noticed (alternatively, setting the environment varibleCOMPLUS_HeapVerify=1
and enabling thegcUnmanagedToManaged
MDA to verify the heap every time a thread transitions from unmanaged to managed code, though I couldn't get that to work).