I first added the boilerplate code to identify the blobs in my Storage account that contained XML.In the above code, Connection String is the connection string for the Azure Storage Account and Root Container Name is the name of the Blob container I wanted to access.The requirements are that: So, using binary split tools such as Unix Split is out of the question.
One of the optimizations I made to combat this was to use a Hash Set in this function for the list of unexpected tags, as it has the exceptional time complexity of O(1).
This means that the lookup time to search for a value in a Hash Set is extremely fast and (most importantly) does not grow as a function of N (the number of elements in the Hash Set).
Since it's vendor based middle-ware, we are not able to correct this ourselves.
Our best option is to create some pre-processing tool that will first split the big file in multiple smaller chunks before they are processed by the middle-ware.
The 64-bit version is recommended for editing and validating of very large XML files (up to 1GByte).
I recently needed to verify the integrity of a large number of XML files that are stored in Azure Blob Storage.
The XML file comes with a corresponding W3C schema, consisting of a mandatory header part followed by a content element which has several 0..* data elements nested.
For the demo code I re-created the schema in simplified form: The header is neglectable in size.
In my case, I knew that any blob with an “extension” of or would contain XML, so the LINQ query filters on that criteria.