DataSets and Serialization 数据集和序列化（英文版）

DataSets and Serialization

We all know DataSets are not good when you want to serialize them, because they always are serialized as XML + the XMLSchema even if you are using a binary formatter. This implies you get a big serialization payload and bad serialization performance each time you send a DataSet using Remoting or when you store a DataSet in a ASP.NET session variable, just to mention two common scenarios.

I decided to do some research about this. A google search pointed me to an MSDN Magazine article by Dino Eposito talking about ADO.NET Binary Serialization. His solution was to have a 'ghost' class that implemented ISerializable and make that class responsible for serializing a dataset.

His solution worked but it's not very friendly, as you need to wrap all your Typed DataSets with a 'ghost' class.

As with DeKlarit we generate the Typed DataSets ourselves, it seemed that we could modify the way the serialization is done inside the DataSet, avoiding the need of an extra class.

The problem is that System.Data.DataSet implements ISerializable and the GetObjectData is not virtual, so you cannot override it in the Typed DataSets, but, if the Typed DataSets itself implements ISerializable, then you can define your own GetObjectData and do whatever you want.

In addition, the standard DataSet serialization needs to be very generic and support some scenarios that are not very common. For example, it supports adding a DataTable to an existing typed DataSet, so when it deserializes it, the DataTable is created in the deserialized object. With Typed DataSets you usually define your DataTables in the XSD, and it is not usual to add more columns in runtime. So, I decided to not to support that case, as it reduces the amount of data that needs to be serialized.

If the DataSet structure is not serialized, we only need to serialize the DataRows, while preserving the DataRowState (i.e., if the row is flagged as Deleted, we should keep that flag when deserializing the object). We also need to send the row's .Original values if the row was changed.

When deserializing, we just need to add the rows back to the tables, change the DataRowState, and set the .Original values for modified rows.

The numbers

The results were surprising. The DataSet had one DataTable and two columns, a System.Int16 (CustomerId) and a System.String (CustomerName). The following table shows the numbers for 1, 10, 100, 1000 and 10000 rows:

Rows	% of serialization time	% of deserialization time	% of size
1	9.0%	11.1%	13.7%
10	10.8%	14.1%	14.5%
100	13.7%	21.5%	15.8%
1000	12.8%	24.4%	16.3%
10000	18.0%	30.0%	16.6%

All the numbers are expressed as percentages of the time compared to the serialization of the original ADO.NET DataSet. Serializing 1000 rows was almost 10 times faster, deserializing them around 4 times faster and the size about 6 times smaller.

Then I decided to compare it to serializing a simple object array... and the results were also surprising. The DataSet was faster to serialize, slower to deserialize, and about the same size.

The following table shows the numbers, where the 100% is the time it takes to serialize a Customer[], where the Customer class was composed by a System.Int16 (Id) and System.String (name).

Rows	% serialization time	% deserialization time	% size
1	149%	350%	146%
10	81%	252%	120%
100	46%	208%	101%
1000	42%	228%	99%
10000	48%	237%	99%

With 1000 rows is roughly twice as fast to serialize with the DataSet, twice as slow to deserialize, and about the same size.

To benchmark the DataSet serialization I filled the DataSet in such a way that the rows are evenly distributed between Added/Modified/Unchanged/Deleted RowState, as the way the data is serialized/deserialized changes depending on that.

When comparing the new DataSet serialization with the object array serialization, I thought it was unfair to have the RowState distributed in that way, because as the object array does not carry that information. So, I also compared the object array serialization with a DataSet with all the rows marked as 'Added', which is the best scenario for serialization/deserialization. In this case the numbers were:

Rows	% serialization time	% deserialization time	% size
1	150%	342%	146%
10	73%	223%	108%
100	45%	180%	86%
1000	39%	210%	83%
10000	37%	214%	83%

As expected, the numbers were a little better for the DataSet than the previous case.

The code

To do the serialization, the DataSet needs to implement ISerializable:

public class CustomerDataSet : System.Data.DataSet, System.Runtime.Serialization.ISerializable

The following is the DataSet’s GetObjectData method:

void System.Runtime.Serialization.ISerializable.GetObjectData(SerializationInfo si, StreamingContext context)
{ 
   si.AddValue("Customer", Customer.GetObjectData());
}

And the following method is added to the DataTable, which copies the Data in a DataTable to an ArrayList:

public ArrayList GetObjectData()
{
   ArrayList dataRows = new ArrayList();
   // Insert rows information into a worker array
   foreach(DataRow row in this.Rows)
   {
      dataRows.Add((int) row.RowState);
      switch (row.RowState)
      {
         case DataRowState.Modified:     
            dataRows.Add(row[columnCustomerId, DataRowVersion.Original]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Original]);
            dataRows.Add(row[columnCustomerId, DataRowVersion.Current]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Current]);
            break;
         case DataRowState.Deleted:
            dataRows.Add(row[columnCustomerId, DataRowVersion.Original]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Original]);
            break;
         default:
            dataRows.Add(row[columnCustomerId, DataRowVersion.Current]);
            dataRows.Add(row[columnCustomerName, DataRowVersion.Current]);
            break;
      }
   }
   return dataRows;
}

To do the deserialization, we add the following constructor to the DataSet:

protected CustomerDataSet(SerializationInfo info, StreamingContext context)
{
   InitClass(); 
   this.EnforceConstraints = false;
   Customer.SetObjectData((ArrayList) info.GetValue("Customer", typeof(ArrayList)));
   this.EnforceConstraints = true;
}

And add the following method to the DataTable, which adds the ArrayList contents to the DataTable:

public void SetObjectData(ArrayList dataRows)
{  
   int dataRowIdx = 0;
   int rowCount = dataRows.Count - 1;
   while (dataRowIdx < rowCount)
   {
      DataRowState state = (DataRowState) ((int) dataRows[dataRowIdx++]);
      DataRow row = NewRow();
      row[columnCustomerId] = dataRows[dataRowIdx++];
      row[columnCustomerName] = dataRows[dataRowIdx++];
      Rows.Add(row);
  
      switch (state)
      {
         case DataRowState.Unchanged:
            row.AcceptChanges();
            break;
 
         case DataRowState.Deleted:
            row.AcceptChanges();
            row.Delete();
            break;
         case DataRowState.Modified:
            row.AcceptChanges();
            row[columnCustomerId] = dataRows[dataRowIdx++];
            row[columnCustomerName] = dataRows[dataRowIdx++];
            break;
      }
   }
}

You can get the VS.NET 2003 project for the benchmarks here. It takes 1 hour and 20mins in my box to run, you can lower the number of iterations to make it run faster.

Moving forward

How can you take advantage of this? I can think of three ways:

Change the generated code for the DataSet, but it's not a good option, because you should never modify the code generated by a tool.
Write a surrogate class that handles the serialization. This could work, but it's not 'transparent', as you need to trigger the serialization yourself to use it.
Adapt ADO Guy's typed DataSet generator so it performs the serialization using the approach I followed.

And of course you can also buy a tool that generates Typed DataSets ;).

posted on Sunday, September 14, 2003 12:44 AM

DataSets and Serialization 数据集和序列化 （英文版）

DataSets and Serialization

DataSets and Serialization 数据集和序列化（英文版）