mariush

Speed up things when dealing with HUGE amount of small files

mariush replied to mariush's topic in ImgBurn Suggestions

Well, very simple and often used method of reducing searching through arrays is to segment the array into segments that must satisfy some conditions. For example, use the first letter of the file name as a sort of key. All the file names that start with a specific letter must be stored in a specific location in the array. Reserve up to a specific number of entries in the array for file names starting with that letter. When you fill that area with file names starting with that letter, reserve another chunk for file names that start with that letter. Keep track of the order of the files in another array You use a tiny bit more memory (something like 8-16 bytes per entry depending on variable types you use) but you may reduce the number of comparisons by a lot. Here's an example I quickly wrote in PHP where file names added are stored in bucket, where each bucket has room for up to 1024 file names. When a bucket is filled, a new one is created. When searching for a file name, it's only searched in the "buckets" that store file names that start with a specific letter: <?php $buckets = array(); $buckets_cnt = -1; $records = array(); $records_cnt = -1; function bucket_create($letter) { global $buckets,$buckets_cnt; $buckets_cnt++; $buckets[$buckets_cnt] = array('letter'=> $letter, 'cnt' => -1, 'max' => 1023, 'entries' => array()); return $buckets_cnt; } function bucket_find($letter) { global $buckets,$buckets_cnt; if ($buckets_cnt == -1) return FALSE; for ($i=$buckets_cnt;$i>=0;$i--) { if ($buckets[$i]['letter']==$letter) { if ($buckets[$i]['cnt']!==$buckes[$i]['max']) return $i; // if the bucket is not full, add entries here } } return FALSE; } function filename_add($filename) { global $buckets, $records, $records_cnt; $first_letter = substr($filename,0,1); $bucket_id = bucket_find($first_letter); if ($bucket_id===FALSE) { $bucket_id = bucket_create($first_letter); } $i = $buckets[$bucket_id]['cnt']; $buckets[$bucket_id]['cnt']++; $i++; $buckets[$bucket_id]['entries'][$i] = $filename; $records_cnt++; $records[$records_cnt] = array('bucket'=> $bucket_id, 'cnt' => $buckets[$bucket_id]['cnt']); } function filename_find($filename) { global $buckets,$buckets_cnt; $first_letter = substr($filename,0,1); if ($buckets_cnt==-1) return FALSE; for ($i=$buckets_cnt;$i!=-1;$i--) { if ($buckets[$i]['letter'] == $first_letter) { if ($buckets[$i]['cnt']!=-1) { for ($j=0;$j<=$buckets[$i]['cnt'];$j++) { if ($buckets[$i]['entries'][$j] == $filename) return array('bucket'=>$i,'offset' => $j); } } } } return FALSE; } $ret = filename_add('test.htm'); $ret = filename_add('rabbit.txt'); $ret = filename_add('test2.htm'); var_dump($buckets); $ret = filename_find('test2.htm'); var_dump($ret); var_dump($records); ?> and here's the output... The buckets array, you can see two buckets were created, one for the file names starting with "t" and one for the file names starting with "r" : array(2) { [0]=> array(4) { ["letter"]=> string(1) "t" ["cnt"]=> int(1) ["max"]=> int(1023) ["entries"]=> array(2) { [0]=> string(8) "test.htm" [1]=> string(9) "test2.htm" } } [1]=> array(4) { ["letter"]=> string(1) "r" ["cnt"]=> int(0) ["max"]=> int(1023) ["entries"]=> array(1) { [0]=> string(10) "rabbit.txt" } } } and this is what the find file function returns , it finds the file name in first bucket ( 0 ) , second entry (the id 1 because entries start from 0) : array(2) { ["bucket"]=> int(0) ["offset"]=> int(1) } and this is what the "records" array holds, the order in which the file names were added .. you can see (bucket: offset) pairs : (0:0) , (1:0) , (0:1) array(3) { [0]=> array(2) { ["bucket"]=> int(0) ["cnt"]=> int(0) } [1]=> array(2) { ["bucket"]=> int(1) ["cnt"]=> int(0) } [2]=> array(2) { ["bucket"]=> int(0) ["cnt"]=> int(1) } } You could use one byte for offsets (up to 256 entries per "bucket") and maybe 4 bytes for bucket id and then you'd have a maximum of 1,099,511,627,776 files that could be tracked by you in memory ( 2^32 x 256) ..... so in my particular situation with 560k files, there would probably be around 800 - 1000 buckets (if defined as having max 1024 files per bucket) and when you're searching for a file you may only have to search within 100 or so buckets instead of going through all buckets.

Speed up things when dealing with HUGE amount of small files

mariush replied to mariush's topic in ImgBurn Suggestions

I'm sorry but I have to disagree. I'm a bit of a programmer in my spare time, but obviously as I don't have the source code of the software I can't say for sure how imgBurn works internally, I can only assume. However, based strictly on how long it takes, when it needs to validate all names, it looks to me like imgBurn takes one file entry at a time, and checks it against all the entries it already checked before so that the generated udf file name would be unique or whatever, and then adds the entry to some internal array, then moves on to the next entry. So when it adds the 10th file, it runs through the previous 9 entries, when it adds the 100th file, it checks previous 99 entries, and eventually when it adds the 560000-th entry, it checks all the previous 559999 entries. So the more entries, the time to complete everything increases exponentially. If I look with resource monitor, I don't see the disk being accessed while these names are validated (to comply with iso9660 or udf), so I doubt that it's disk access slowing things down. I also see imgBurn's memory consumption steadily increasing over time, as if every once in a while the software tries to resize the array creating a new array, copying data into the new location and removing the old array from memory. It truly seems to me like poorly coded scan through previous entries (or just usage of a basic algorithm that works well with small number of records) and maybe using too small buffers for the entries (like reserving memory for 1000 file entries at a time, instead of automatically adjusting this number based on the total number of files detected previously)

Speed up things when dealing with HUGE amount of small files

mariush posted a topic in ImgBurn Suggestions

This may be very uncommon and I would understand if it would be too complicated or not worth working on it, but I figured it can't hurt to ask. I often have to (or like to) mirror some websites locally, in order to preserve the information as it's shown at a particular point in time. Think documentation, reference, small articles. A lot of these websites use "pretty urls", which means the offline copy of a website ends up with a huge amount of files in a single folder, for example: ... in an offline copy I made last week, I ended up with around 560'000 such files (~5-20 KB in size each) in a single folder. Windows has a really dumb implementation of their NTFS file system and whenever there's more than around 10-15k files in a single folder, access times go up significantly, seeking in folder is slower and so on, not to mention the file allocation table is cluttered with so many file entries and the hard drive gets fragmented. So what I like to do is to create an ISO image of this folder and mount it as a virtual dvd whenever I actually need to access the contents. this also allows me to use a separate software to index the contents of the disk image for fast search and article retrieval The problem is imgBurn is extremely slow at those steps where it generates IS9660 and UDF compliant file names - with the last attempt that involved these 560k files and approximately 12 GB of content, it took probably more than one and a half hours for imgBurn to actually start writing the image to disk after I clicked the "go" button, it took that much to generate the compliant UDF file names (I selected only UDF in settings, otherwise it would have taken probably twice as much time). Note that it also takes the same amount of time when clicking on the Calculator icon to estimate things. I actually cancelled that process and that's what made me decide to disable iso9660 and leave just udf selected. Seems to me like this process of generating file names is single threaded, and probably the software working with very small memory allocations and maybe during that process of generating compliant names, memory keeps getting allocated in small chunks and copied over or arrays are being resized etc etc I guess it would make sense not to go overboard in that area, considering my case is an exception, that most DVDs would only deal with hundreds or thousands of file names, but would it be possible to at least have some sort of "advanced options" which would switch the behavior from "assume few files and use little memory and do things like they are now" to "go wild and use as much memory and cpu cores to do your stuff as long as you're fast"? Would it be possible to determine if user wants both iso9660 and udf compliant file names and compute those at same time instead of doing things separately, if this would speed things up? Thank you for taking time to read my request/suggestion, Marius

Sign In

Posts

Joined

Last visited

mariush's Achievements

ISF Newbie (1/5)

Speed up things when dealing with HUGE amount of small files

Speed up things when dealing with HUGE amount of small files

Speed up things when dealing with HUGE amount of small files

Browse

Activity

Important Information