Optimal size of HDD for virtual Digitization server

=Introduction=

Proper usage of server space and capacity is very important for companies these days as the volume of data gathered and produced is growing rapidly and is essential for proper operation of modern companies. This simulation is designed to calculate optimal HDD size for a server which is used for processing documents and is based on real data obtained from a company production server.

=Problem definition= The simulation is based on data gathered by a server application in database, observations of the server (mainly focused on batch count and their size) and additional information about the solution running on the server gathered from documentation:
 * Processing takes place only on work days
 * Every document scanned to the server is processed by a digitization application which creates batch that contains original scanned document, extracted data in XML files, log files, enhanced images and searchable PDF
 * Application saves all important information about batches in database
 * Backup images from scanning will stay on the server for 6 months (those are additional ~50 % of the batch size) and are in separate folder from the batch folders
 * Successfully processed batches older than 14 days are deleted every day by another application that also produces log files about the successful/failed deletion
 * For precaution let’s assume 5% of batches could fail to be processed correctly (however in obtained sample all batches were processed correctly)
 * Those will stay on the server and will be processed every 6 months by admins when deleting backup

Observation
First observation was made to assess the data available and to decide about proper approach when designing the simulation. Information gathered during observation:
 * Each day an average of 17 batches were processed by the server with average size per batch of 24 MB
 * The number of batches changes a lot and can’t be easily predicted so it will have to be taken into consideration
 * Information about deleted batches can be found in database on the server

Simulation environment
MS Excel 2016

Simulation method
Monte Carlo

=Obtaining real data= The data for simulation was extracted from a database which contains information about all the batches that were processed on the server during the last 65 working days. Total of 1124 batches were processed during those days. Important part of the database table needed for simulation is included in the MS Excel file.

=Derivation of probability distributions= After the data was gathered and analyzed through a contingency table, graphs were made to display the average batch size (in bins with increment of 5000 KB) and the average batch count per day. Both those graphs are from the log-normal distribution and so the random values were generated based on those distributions for average batch size as well as the average batch count.





=Simulation= For the simulation 132 working days (6 months) were considered as the time before backup deletion and error fix-up takes place. For both average batch size and average batch count random values were generated from the log-normal distribution based on scaled mean (m) and standard deviation (s) of the original data set.

Those random values were generated 132 times and then multiplied. After multiplication of those 2 values followed calculating the error and backup size and the cumulative size which was reduced each step starting at 10 steps (14 days for deletion of the oldest batch). Overall size after 6 months was used in excel data table to generate this simulation 1000 times.

=Results= After generating 1000 simulations the AVG, highest and lowest space needed was calculated and is displayed in table below:

Issue with the results is that the highest space needed changes a lot (because the batch size and count changes a lot as was mentioned at the beginning of the paper) even when running the simulation 1000 times. It had to be taken into consideration when deriving conclusions.

=Conclusion= One option for the company when deciding how much space should be reserved for document processing based on the simulation would be to take into consideration the highest space needed from 1000 simulations which is nearly 101 GB. Second recommendation, when using this option, would be to monitor the server at least once a month as the highest value varies a lot and adjust the size if needed. Upside of this approach is that it requires less initial work.

Due to high variance another approach for defining the space needed is to consider only the average space needed so the HDD would have “only” 34 GB free space starting each 6 months. However, when using this approach, if the HDD starts getting full, increasing the size automatically based on the last month when backup deletion took place is necessary. Upside of this solution is proper space utilization as most of the time big part of disk space would be unused if one considers only the highest size needed (based on the average space needed).

=Code=

The Excel file with simulation can be found here: https://app.box.com/s/duj71rtgvf4g1u9gi011tcib2umnjdil