Mediatex 0.9: How much

1.6 How much

In order to prevent from deny of services, we do not use a centralise database but text files spread on all servers using GIT.

Consequently parsers needs a proportional amount of CPU and memory in regards to these file’s sizes (whereas databases without indexe do not). The generated HTML catalogue also requires much more place than a dynamic web site does (moreover, limitation should comes from the number of available inodes on the partition where the HTML catalogue is stored).

All in all, the MEDIATEX system is not designed to handle collections having more than half a million archives (whereas databases easily handle millions). It should handle several such “not so big” collections, but not toot much too.

Following tests are based on the GIT upgrade plus HTML catalogue generation, which is the more consumming query (and which imply parsing most meta-data files). It gives an idea of resources (size on disk, amount of memory and CPU time) involved.

archives	GIT	RAM	HTML	HTML inodes	time
27,550	30M	74M	357M	88,717	1’06
54,950	59M	132M	598M	148,294	1’47
82,398	88M	191M	840M	207,985	3’25
110,006	118M	251M	1,1G	268,029	3’18
137,561	147M	310M	1,3G	328,002	5’21
165,104	177M	371M	1,6G	387,836	5’11
192,771	207M	432M	1,8G	447,971	5’47
220,346	237M	493M	2,1G	507,945	5’18
247,861	267M	553M	2,3G	567,664
302,912	326M	674M	2,8G	687,278	8’31
330,371	356M	735M	3,0G	746,934
358,005	386M	796M	3,2G	807,007	10’43
385,425	416M	856M	3,5G	866,551	11’17
412,848	446M	916M	3,7G	926,102	12’29
440,405		977M	3,9G	985,990
467,899	505M	1038M	4,2G	1,045,755	23’37
495,383	535M	1098M	4,4G	1,105,445	14’33

Notice:

This benchmark is run on a i686 operating system with an AMD Sempron(tm) Processor 3200+ and 2G of RAM. It make me aware that ADM64 system use double of memory.
Benchmark is done sequentially in order to test the GIT merging with a constant amount of data added each time. Synchronisation times using GIT depends on the network connection, and the upper benchmark is using the local network interface.
Time spent looks quiet linear (I was expecting O(n*log(n)). GIT push and HTML serialisation last most time, about 40% each. Parsing and serializing metadata last about 10% each.
I was happily surprized that the logarithmic factor is not accented by the fact the files are serialised and parsed (into b-trees) using the same order.
The audit report process is not optimised. It is possible to handle up to 1,000,000 archives or 500M of metadata by collection, but this particulary process has not been tested yet with more than 50,000 archives (as it last hours).