[BUG REPORT, C# SKD v2.16.2905] Managed heap corruption caused by Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead

[BUG REPORT, C# SKD v2.16.2905] Managed heap corruption caused by Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead // Expert user has replied. M Mijail Todorovich 1 year 1 month ago 230 3 0 Hi. I've been chasing a nasty heap corruption bug for a while and I think I finally found the culprit in `Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead`. Using WinDbg's `!verifyheap` command I found that the heap corruption was always caused by a string similar to this one overwritting something: `{"device.host_status":"030,0,1,0642,003,0,0,0,000,0,0,0\r\n001,0,0,0,1,2,6,0,00000001,1,000\r\n0000,0"}` Which is the output of the following SGD command: `{}{"device.host_status":null}` This points to the `Printer.GetCurrentStatus` method. Using dotPeek I looked at the decompiled source code of the SDK (version 2.16.2905), tracing the code for `GetCurrentStatus` until I arrived at `Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead` where I found something fishy. Here's the decompiled source code for `BeginRead`: public override unsafe IAsyncResult BeginRead( byte[] buffer, int offset, int count, AsyncCallback callback, object state) { NativeMethods.SetLastError(0U); GCHandle gcHandle = GCHandle.Alloc((object) buffer, GCHandleType.Pinned); UsbAsyncResult ar = new UsbAsyncResult(state, callback, true); Overlapped overlapped1 = new Overlapped(0, 0, ar.AsyncWaitHandle.SafeWaitHandle.DangerousGetHandle(), (IAsyncResult) ar); try { byte[] userData = new byte[count]; NativeOverlapped* overlapped2 = overlapped1.Pack(this.m_IOCompletionCallback, (object) userData); byte* buffer1 = (byte) ((ulong) gcHandle.AddrOfPinnedObject().ToInt64() + (ulong) offset); if (gcHandle.IsAllocated) gcHandle.Free(); if (NativeMethods.ReadFile(this.m_hFile, buffer1, (uint) count, out uint _, overlapped2)) { ar.m_bCompletedSynchronously = true; return (IAsyncResult) ar; } int lastWin32Error = Marshal.GetLastWin32Error(); if (lastWin32Error == 997) return (IAsyncResult) ar; throw new IOException("Unable to initialize read. Error code: " + lastWin32Error.ToString()); } finally { if (gcHandle.IsAllocated) gcHandle.Free(); } } I think `buffer` is being moved by the GC after the assignment of the pointer `buffer1` and before `ReadFile` finishes. This could be avoided if `buffer` was passed to `overlapped1.Pack` instead of `userData`. From the documentation for Overlapped.Pack: Remarks* [...] The buffer or buffers specified in `userData` must be the same as those passed to the unmanaged operating system function that performs the asynchronous I/O. Note: The runtime pins the buffer or buffers specified in `userData` for the duration of the I/O operation. If the application domain is unloaded, the runtime keeps the memory pinned until the I/O operation completes. I patched that change into SdkApi.Desktop.Usb.dll using dnSpy and ran a 48 hour stress test and the issue has not reappeared (before, it crashed after about 4 to 8 hours). It could be luck, though. I'll continue the stress test as long as I can and report back. I can provide a few memory dumps if necessary.

Please Register or Login to post a reply

3 Replies

M Mijail Todorovich 1 year ago

After 6 days of continuous stress test with the patched DLL, the issue has still not reappeared.

S Steven Si 1 year ago

This is an interesting finding. Thanks for reporting and sharing solution. How do you conduct the stress test? Do you just keep calling the getCurrentStatus() at a fixed interval?

M Mijail Todorovich 1 year ago

I conducted the stress test using a program I'm developing. It's not a minimal reproducible example by any means, but it's not that complex either.

It's a WPF app which basically boils down to calling getCurrentStatus and printing a label every time a NumberBox is increased. To keep the UI responsive while avoiding two threads from talking to the printer at the same time, there's a queue implemented with a TPL DataFlow which processes the print jobs one at a time on ThreadPool threads. For the test, I added a timer which increases the NumberBox every two seconds.

I ran the test using a paused ZD421. To avoid filling the printer's buffer, I send ~JA every 5 prints (using the queue).

To try and catch the error as close as possible to the corruption, I set the following flags using GFlags: Enable heap tail checking, Enable heap free checking, Enable heap parameter checking, Enable heap validation on call, Enable page heap, Enable heap tagging, Enable application verifier.

I think the important parts for reproducing are calling getCurrentStatus in a background thread while on the main thread having enough allocations and deallocations to trigger heap compaction. Also, accessing objects in the heap so corruption is eventually noticed (alternatively, setting the environment varible COMPLUS_HeapVerify=1 and enabling the gcUnmanagedToManaged MDA to verify the heap every time a thread transitions from unmanaged to managed code, though I couldn't get that to work).