Wednesday, June 17, 2009

High Performance Method to Copy C# KeyedCollection<,>, List<>, and Dictionary<,>

What do you do if you have a collection with 1,000,000 items in it and you want to make a quick copy?  C# says you should loop through the source collection and call .add() on the destination collection, but what if that is just waaaaaaaaaaaaay toooooooooooo sloooooooow?

You could write some unsafe code to get the job done.  What I did here was make some structures that contained all the instance variables of the aforementioned classes.  I then overlaid these structures on top of their classes so I can gain access to the private member variables.  Once inside a List<>, I just use Array.Copy to copy the internal array that makes up the List<>.  There are a couple more variables that have to be copied like the number of elements currently in use in the array, etc.  I do the same sort of thing with Dictionary<,> and KeyedCollection.  KeyedCollection actually calls the CopyList and CopyDict functions.

WARING!!!!!! If you don’t like unsafe code and hacking the C# host, then turn back now.

WARNING!!!!! This is for 32 bit only.  For 64 bit, we have to switch from int to Int64 when dealing with addresses.  I used a wacky way of getting around the C# casting complaints.  I used an integer pointer that I pointed to the stack.  By using array indexing [] on this pointer, I gained unchecked read/write access to the stack, which is where all the local variables live.   I can even change where managed objects point without C# complaining!!!  I don’t recommend changing them to point to just anything…  Hilarity will ensue.  Some reader out there might know a better way to do this.

WARNING!!!! This is for Windows Only.  Don’t try this on Mono.  Well you can, but expect to change the code a bit.

WARNING!!!! I tested this is for Net 3.5.  The layout of the instance variables of the classes might be different in 2.0 and 3.0.  Also, at any time, Redmond can release a patch and change the layout of these classes.  Hopefully they won’t, but don’t be surprised if they do. 

Warnings aside, if you have a performance problem related to copying collections, this will solve them.

using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;
using System.Collections.ObjectModel;
using System.Reflection;

namespace CopyCollection {

  // these are just some test classes that get copied in 
  // the main()
  class CFoo {
    public int Key;
    public string Name;
  }
  class MyKeyedCollection : KeyedCollection<int, CFoo> {
    public MyKeyedCollection() : base(null, 10) { }
    protected override int GetKeyForItem(CFoo foo) {
      return foo.Key;
    }
  }
  class MyObject {
    public MyKeyedCollection kc;
    // Copy constructor
    public MyObject(MyObject that) {
      this.kc = new MyKeyedCollection();
      if (that != null) {
        CollectionTools.CopyKeyedCollection<int, CFoo>(
         that.kc, this.kc);
      }
    }
  }
  class Program {

    static void Main(string[] args) {

      // here I create an object, containing a 
      // keyedcollection with 7 items
      MyObject mobj1 = new MyObject(null);
      for (int i = 0; i < 7; ++i)
        mobj1.kc.Add(
          new CFoo() { 
            Key = i, Name = i.ToString() 
          }
        );
      // quick copy it
      MyObject mobj2 = new MyObject(mobj1);
      // add a lot more items
      for (int i = 8; i < 712324; ++i)
        mobj2.kc.Add(
          new CFoo() { 
            Key = i, Name = i.ToString() 
          }
        );
      // quick copy this one.
      MyObject mobj3 = new MyObject(mobj2);
      // by looking in debugger, you can see that the first 
      // list has 7 items, the second has 712331 and so does 
      // the 3rd the first 7 items of all objects are the 
      // same the last 712324 items of the last 2 objects 
      // are the same!  hey this works!
    }
  }

  /* ---------------------------------------------------------
   * CollectionTools contains three functions that can be
   * used to copy a list, a dictionary, or a keyedcollection
   * ------------------------------------------------------ */
  public static class CollectionTools {

    public unsafe static KeyedCollection<TKey, TValue> 
     CopyKeyedCollection<TKey, TValue>(
     KeyedCollection<TKey, TValue> src, 
     KeyedCollection<TKey, TValue> dst) {

      object osrc = src;
      // TKeyedCollection* is pointer to a structure 
      // that is a template for the 
      // instance variables of A KeyedCollection<TKey, TValue>
      // What is going on here is that src is copied to osrc, 
      // which is on the stack exactly 4 bytes after psrc. 
      // hence the (int*)&psrc + 1.  osrc is a pointer to
      // the object on the heap, so we copy *osrc to &psrc
      // which causes our structure to overlay the class
      // if you change the order of these variables on the 
      // stack, then this trick wont work.
      TKeyedCollection* psrc = 
       (TKeyedCollection*)(*((int*)&psrc + 1));  
      object odst = dst;
      TKeyedCollection* pdst = 
       (TKeyedCollection*)(*((int*)&pdst + 1));
      object srcObj = null;  // i use i[2] to change this
      object dstObj = null;  // i use i[1] to change this
      int* i = (int*)&i;  // helps me find the stack

      // copy the internal list, by calling CopyList
      i[2] = (int)psrc->_01_items;  
      dstObj = CopyList<TValue>(srcObj as List<TValue>);
      pdst->_01_items = (uint)i[1];

      // copy the internal dictionary.  There is only a
      // dictionary once an internal count threshold has
      // been reached.  Sometimes there is no dictionary
      if (psrc->_04_dict != 0) {
        // copy the internal dictionary by calling CopyDict
        i[2] = (int)psrc->_04_dict;
        dstObj = CopyDict<TKey, TValue>(
         srcObj as Dictionary<TKey, TValue>);
        pdst->_04_dict = (uint)i[1];
      }

      // copy the rest of the values 
      pdst->_03_comparer = psrc->_03_comparer;
      pdst->_05_keyCount = psrc->_05_keyCount;
      pdst->_06_threshold = psrc->_06_threshold;
      return dst;
    }

    public unsafe static List<TValue> CopyList<TValue>(
     List<TValue> src) {

      // see previous comments for notes about these local
      // variables
      object osrc = src;
      TList* psrc = (TList*)(*((int*)&psrc + 1));  
      object srcArray = null;
      object dstArray = null;
      int* i = (int*)&i;  

      // a list has an array of items
      
      // I need c# to consider _01_items as an object,
      // so I use my handy dandy stack munger, i, to
      // set the srcArray and dstArray
      i[2] = (int)psrc->_01_items;  
      // its easy to create an empty array of the right
      // size using one of the List<> constructors
      int capacity = (srcArray as Array).Length;
      List<TValue> dst = new List<TValue>(capacity);
      // overlay a hack structure on the list we just
      // created, so we can copy stuff to it
      TList* pdst = (TList*)(*((int*)&pdst + 1));
      // now setup dstArray
      i[1] = (int)pdst->_01_items; 
      // now that srcArray and dstArray are set, we can  
      // let C# do the copying for us.  This saves us 
      // having to determine the size of each element
      Array.Copy(srcArray as Array, dstArray as Array, 
       capacity);

      // set the size, which is the # of used elements 
      // of the list
      pdst->_03_size = psrc->_03_size;

      return dst;
    }

    public unsafe static Dictionary<TKey, TValue> 
     CopyDict<TKey, TValue>(
     Dictionary<TKey, TValue> src) {

      // see previous comments for notes about these local
      // variables
      object osrc = src;
      TDictionary* psrc = 
       (TDictionary*)(*((int*)&psrc + 1)); 
      object srcArray = null;
      object dstArray = null;
      int* i = (int*)&i;  // helps me find the stack

      // a dictionary has a hash table called buckets

      // I need c# to consider _01_buckets as an object,
      // so I use my handy dandy stack munger, i, to
      // set the srcArray and dstArray
      i[2] = (int)psrc->_01_buckets;
      // create a dictionary the same size as the src 
      // dictionary
      int capacity = (srcArray as Array).Length;
      Dictionary<TKey, TValue> dst =
       new Dictionary<TKey, TValue>(capacity);
      // overlay the hack structure
      TDictionary* pdst = 
       (TDictionary*)(*((int*)&pdst + 1));
      // now setup dstArray
      i[1] = (int)pdst->_01_buckets;
      // now that srcArray and dstArray are set, we can  
      // let C# do the copying for us.  This saves us 
      // having to determine the size of each element
      Array.Copy(srcArray as Array, dstArray as Array,
       capacity);

      // a dictionary also has keeps the data stored 
      // in an array
      i[2] = (int)psrc->_02_entries;
      i[1] = (int)pdst->_02_entries;
      Array.Copy(srcArray as Array, dstArray as Array, 
       capacity);

      // copy the remaining simple variables
      pdst->_03_comparer = psrc->_03_comparer;
      pdst->_04_m_siInfo = psrc->_04_m_siInfo;
      pdst->_08_count = psrc->_08_count;
      pdst->_10_freeList = psrc->_10_freeList;
      pdst->_11_freeCount = psrc->_11_freeCount;

      return dst;
    }
    /* -------------------------------------------------------
     * These are the structs that map onto the classes which
     * I use to access the private member variables
     * Items marked with * I determined needed to be copied
     * -----------------------------------------------------*/
 
    struct TKeyedCollection {
      public uint _00_MethodInfo;   //
      // Collection
      public uint _01_items;      // * IList<T>
      public uint _02_syncRoot;   //   object
      // KeyedCollection
      public uint _03_comparer;   //   IEqualityComparer<TKey> 
      public uint _04_dict;       // * Dictionary<TKey, TItem> 
      public int _05_keyCount;    // *
      public int _06_threshold;   // *
    }

    struct TList {
      public uint _00_MethodInfo; //
      public uint _01_items;      // * T[] 
      public uint _02_syncRoot;   //   object
      public int _03_size;        // *
      public int _04_version;     //
    }

    struct TDictionary {
      public uint _00_MethodInfo; //
      public uint _01_buckets;    // * int[] 
      public uint _02_entries;    // * Entry<TKey, TValue>[] 
      public uint _03_comparer;   //   IEqualityComparer<TKey> 
      public uint _04_m_siInfo;   //   SerializationInfo
      public uint _05__syncRoot;  //   object 
      public uint _06_keys;       //   KeyCollection<,> 
      public uint _07_values;     //   ValueCollection<,> 
      public int _08_count;       // *
      public int _09_version;
      public int _10_freeList;    // * 
      public int _11_freeCount;   // *
    }

  }


}

This construct, “int* i = (int*)&i”, sets the pointer i (nice name for a pointer…) to point to itself.  “i” is located on the stack, so now I have a pointer to a spot on the stack.  Now if we use “i” as an array, we can access variables on the stack in ways that keep C# from complaining about casting.  You don’t get a “cannot cast a managed object to a void *” (or vice versa) error when you set an object this way: “i[2] = (int)psrc->_01_items”  “i[2]” means the int located 2 ints above “i” in the stack.  I assume everyone knows ints are 4 bytes.  I chose ints because almost everything I am messing with is a managed object, which is really a pointer, which is 4 bytes long.  Now if you look at the variables declared on the stack in front of “i”, just count up 2 ints.  This is the variable that is being set by “i[2]”.  Once I get better at this blogging thing I will draw a picture.

      object srcObj = null;  // i use i[2] to change this
      object dstObj = null;  // i use i[1] to change this
      int* i = (int*)&i;  // helps me find the stack

The same sort of thing is going on here

object osrc = src;
TDictionary* psrc = 
 (TDictionary*)(*((int*)&psrc + 1)); 

&psrc is on the stack, 4 bytes before it, “(int*)&psrc + 1”, is osrc.  The value of osrc, “*((int*)&psrc + 1)”, is the address of src in the heap.  Cast that as a TDictionary* and you have just overlaid a structure on a class.

The way I figured out what to copy was by using the RedGate .Net Relfector.  I just looked at the code for all the aforementioned collections to figure out what the instance variables were and then what was important and needed to be copied.

Unfortunately, the order of the instance variables in RedGate’s Reflector and in Type.GetFields() is not the order of the instance variables in memory.  To figure this out, I had to look at the objects in the memory dump of visual studio.  Techniques to figure out what random bytes in the heap actually are is a good topic for a future article.

3 comments:

Reed Arneson said...

Best post ever. \m/

Unknown said...

If you're using a serializable object like a KeyedCollection or a generic list why not just copy it with a memory stream?

public static class ObjectCopier
{
   public static T Clone<T>(T source)
   {
        if (!typeof(T).IsSerializable)
        {
          throw new ArgumentException("The              type must be serializable.", "source");
        }
        if (object.ReferenceEquals(source, null))
        {
               return default(T);
        }
        IFormatter bF = new BinaryFormatter();
        Stream stream = new MemoryStream();
        using (stream)
        {
            bF.Serialize(stream, source);
           stream.Seek(0L, SeekOrigin.Begin);
           return (T) bF.Deserialize(stream);
        }
   }
}

If the list is too big you could get an out of memory exception,
but it seems a little easier.

I haven't benchmarked the method but it works pretty good for lists
of objects under 10,000 records I've used.

Use Serialization-To-Deep-Copy-Objects

osm18 said...


porno izle