Jump to content

Speed up things when dealing with HUGE amount of small files


mariush

Recommended Posts

This may be very uncommon and I would understand if it would be too complicated or not worth working on it, but I figured it can't hurt to ask.

I often have to (or like to) mirror some websites locally, in order to preserve the information as it's shown at a particular point in time. Think documentation, reference, small articles.

A lot of these websites use "pretty urls", which means the offline copy of a website ends up with a huge amount of files in a single folder, for example:
 

Quote

 

website/articles/125-reference-function-copyfile.htm

[...]

website/articles/500-examples-function-movefile.htm

 

... in an offline copy I made last week, I ended up with around 560'000 such files (~5-20 KB in size each) in a single folder.

Windows has a really dumb implementation of their NTFS file system and whenever there's more than around 10-15k files in a single folder, access times go up significantly, seeking in folder is slower and so on, not to mention the file allocation table is cluttered with so many file entries and the hard drive gets fragmented.

So what I like to do is to create an ISO image of this folder and mount it as a virtual dvd whenever I actually need to access the contents. this also allows me to use a separate software to index the contents of the disk image for fast search and article retrieval

The problem is imgBurn is extremely slow at those steps where it generates IS9660 and UDF compliant file names - with the last attempt that involved these 560k files and approximately 12 GB of content, it took probably more than one and a half hours for imgBurn to actually start writing the image to disk after I clicked the "go" button, it took that much to generate the compliant UDF file names (I selected only UDF in settings, otherwise it would have taken probably twice as much time). 

Note that it also takes the same amount of time when clicking on the Calculator icon to estimate things. I actually cancelled that process and that's what made me decide to disable iso9660 and leave just udf selected.

Seems to me like this process of generating file names is single threaded, and probably the software working with very small memory allocations and maybe during that process of generating compliant names, memory keeps getting allocated in small chunks and copied over or arrays are being resized etc etc

I guess it would make sense not to go overboard in that area, considering my case is an exception, that most DVDs would only deal with hundreds or thousands of file names, but would it be possible to at least have some sort of "advanced options" which would switch the behavior from "assume few files and use little memory and do things like they are now" to "go wild and use as much memory and cpu cores to do your stuff as long as you're fast"? 

Would it be possible to determine if user wants both iso9660 and udf compliant file names and compute those at same time instead of doing things separately, if this would speed things up?

 

Thank you for taking time to read my request/suggestion,

Marius

Edited by mariush
Link to comment
Share on other sites

I'm sorry but I have to disagree.

I'm a bit of a programmer in my spare time, but obviously as I don't have the source code of the software I can't say for sure how imgBurn works internally, I can only assume.

However, based strictly on how long it takes, when it needs to validate all names, it looks to me like imgBurn takes one file entry at a time, and checks it against all the entries it already checked before so that the generated udf file name would be unique or whatever, and then adds the entry to some internal array, then moves on to the next entry. So when it adds the 10th file, it runs through the previous 9 entries, when it adds the 100th file, it checks previous 99 entries, and eventually when it adds the 560000-th entry, it checks all the previous 559999 entries.  So the more entries, the time to complete everything increases exponentially.

If I look with resource monitor, I don't see the disk being accessed while these names are validated (to comply with iso9660 or udf), so I doubt that it's disk access slowing things down. I also see imgBurn's memory consumption steadily increasing over time, as if every once in a while the software tries to resize the array creating a new array, copying data into the new location and removing the old array from memory.

It truly seems to me like poorly coded scan through previous entries (or just usage of a basic algorithm that works well with small number of records) and maybe using too small buffers for the entries (like reserving memory for 1000 file entries at a time, instead of automatically adjusting this number based on the total number of files detected previously)

 

Edited by mariush
Link to comment
Share on other sites

I'm somewhat of a programmer, too; I majored in computer science.  :wink:   Of course, all we're doing is speculating until LUK chimes in with how, exactly, he approached the code.  I've tended to notice things I've posted as issues/bugs with file navigation, access, etc. are generally unresolvable because they're down to how Microsoft implements them in Visual C++.

 

Of course, I got my degree in 1996, so things have changed considerably in that time.  You see, back in my day, we used to have these things called floppy disks...

Edited by dbminter
Link to comment
Share on other sites

It dynamically adds items to an internal list based on the findings of the 'FindFirstFile'  API command. There's no way of knowing in advance (that I know of anyway), how many files/folders it'll be dealing with. It just traverses the tree.

Each file that gets added for a given directory has its name checked for compliance with the file system and a new one is calculated as required. It then has to check if that name is unique and calculate a new one if not.

I don't know how to improve upon that, so don't expect anything to change within the program - it's very unlikely!

Also, how else do you check for uniqueness if not by comparing it against all previous items? Is there something I'm missing?!

 

Link to comment
Share on other sites

Well, very simple and often used method of reducing searching through arrays is to segment the array into segments that must satisfy some conditions.

For example, use the first letter of the file name as a sort of key.  All the file names that start with a specific letter must be stored in a specific location in the array. Reserve up to a specific number of entries in the array for file names starting with that letter. When you fill that area with file names starting with that letter, reserve another chunk for file names that start with that letter. Keep track of the order of the files in another array

You use a tiny bit more memory (something like 8-16 bytes per entry depending on variable types you use) but you may reduce the number of comparisons by a lot.

Here's an example I quickly wrote in PHP where file names added are stored in bucket, where each bucket has room for up to 1024 file names. When a bucket is filled, a new one is created. 

When searching for a file name, it's only searched in the "buckets" that store file names that start with a specific letter:

 

<?php

$buckets = array();
$buckets_cnt = -1;

$records = array();
$records_cnt = -1;

function bucket_create($letter) {
	global $buckets,$buckets_cnt;
	$buckets_cnt++;
	$buckets[$buckets_cnt] = array('letter'=> $letter, 'cnt' => -1, 'max' => 1023, 'entries' => array());
	return $buckets_cnt;
}

function bucket_find($letter) {
	global $buckets,$buckets_cnt;
	if ($buckets_cnt == -1) return FALSE;
	for ($i=$buckets_cnt;$i>=0;$i--) {
		if ($buckets[$i]['letter']==$letter) {
			if ($buckets[$i]['cnt']!==$buckes[$i]['max']) return $i;  // if the bucket is not full, add entries here
		}
	}
	return FALSE;
}

function filename_add($filename) {
	global $buckets, $records, $records_cnt;
	$first_letter = substr($filename,0,1);
	$bucket_id = bucket_find($first_letter);
	if ($bucket_id===FALSE) {
		$bucket_id = bucket_create($first_letter);
	}
	$i = $buckets[$bucket_id]['cnt'];
	$buckets[$bucket_id]['cnt']++;
	$i++;
	$buckets[$bucket_id]['entries'][$i] = $filename;
	$records_cnt++;
	$records[$records_cnt] = array('bucket'=> $bucket_id, 'cnt' => $buckets[$bucket_id]['cnt']);

}

function filename_find($filename) {
	global $buckets,$buckets_cnt;
	$first_letter = substr($filename,0,1);
	if ($buckets_cnt==-1) return FALSE;
	for ($i=$buckets_cnt;$i!=-1;$i--) {
		if ($buckets[$i]['letter'] == $first_letter) {
			if ($buckets[$i]['cnt']!=-1) {
				for ($j=0;$j<=$buckets[$i]['cnt'];$j++) {
					if ($buckets[$i]['entries'][$j] == $filename) return array('bucket'=>$i,'offset' => $j);
				}				
			}
		}
	}
	return FALSE;
}


$ret = filename_add('test.htm');
$ret = filename_add('rabbit.txt');
$ret = filename_add('test2.htm');

var_dump($buckets);
$ret = filename_find('test2.htm');
var_dump($ret);

var_dump($records);

?>

 

and here's the output...

The  buckets array, you can  see two buckets were created, one for the file names starting with  "t"  and one for the file names starting with "r" :

array(2) {
  [0]=>
  array(4) {
    ["letter"]=>
    string(1) "t"
    ["cnt"]=>
    int(1)
    ["max"]=>
    int(1023)
    ["entries"]=>
    array(2) {
      [0]=>
      string(8) "test.htm"
      [1]=>
      string(9) "test2.htm"
    }
  }
  [1]=>
  array(4) {
    ["letter"]=>
    string(1) "r"
    ["cnt"]=>
    int(0)
    ["max"]=>
    int(1023)
    ["entries"]=>
    array(1) {
      [0]=>
      string(10) "rabbit.txt"
    }
  }
}

and this is what the find file function returns , it finds the file name in first bucket ( 0 ) , second entry (the id 1 because entries start from 0) :

array(2) {
  ["bucket"]=>
  int(0)
  ["offset"]=>
  int(1)
}

and this is what the "records" array holds, the order in which the file names were added .. you can see (bucket: offset) pairs (0:0) , (1:0) , (0:1)

array(3) {
  [0]=>
  array(2) {
    ["bucket"]=>
    int(0)
    ["cnt"]=>
    int(0)
  }
  [1]=>
  array(2) {
    ["bucket"]=>
    int(1)
    ["cnt"]=>
    int(0)
  }
  [2]=>
  array(2) {
    ["bucket"]=>
    int(0)
    ["cnt"]=>
    int(1)
  }
}

You could use one byte for offsets (up to 256 entries per "bucket") and maybe 4 bytes for bucket id and then you'd have a maximum of  1,099,511,627,776 files that could be tracked by you in memory ( 2^32 x 256) .....  so in my particular situation with 560k files, there would probably be around 800 - 1000 buckets (if defined as having max 1024 files per bucket)  and when you're searching for a file you may only have to search within 100 or so buckets instead of going through all buckets.

 

Link to comment
Share on other sites

Ah, so there was something I was missing ;) I'd never thought to do it like that...I like that idea. :thumbup:

My 'buckets' are currently on a per directory basis and that's fine for 'most' situations. Of course, in a case like yours where you've got *loads* of files in the same directory, this method would make it more efficient.

I appreciate you taking the time to help.

Link to comment
Share on other sites

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.