Well, very simple and often used method of reducing searching through arrays is to segment the array into segments that must satisfy some conditions.
For example, use the first letter of the file name as a sort of key. All the file names that start with a specific letter must be stored in a specific location in the array. Reserve up to a specific number of entries in the array for file names starting with that letter. When you fill that area with file names starting with that letter, reserve another chunk for file names that start with that letter. Keep track of the order of the files in another array
You use a tiny bit more memory (something like 8-16 bytes per entry depending on variable types you use) but you may reduce the number of comparisons by a lot.
Here's an example I quickly wrote in PHP where file names added are stored in bucket, where each bucket has room for up to 1024 file names. When a bucket is filled, a new one is created.
When searching for a file name, it's only searched in the "buckets" that store file names that start with a specific letter:
<?php
$buckets = array();
$buckets_cnt = -1;
$records = array();
$records_cnt = -1;
function bucket_create($letter) {
global $buckets,$buckets_cnt;
$buckets_cnt++;
$buckets[$buckets_cnt] = array('letter'=> $letter, 'cnt' => -1, 'max' => 1023, 'entries' => array());
return $buckets_cnt;
}
function bucket_find($letter) {
global $buckets,$buckets_cnt;
if ($buckets_cnt == -1) return FALSE;
for ($i=$buckets_cnt;$i>=0;$i--) {
if ($buckets[$i]['letter']==$letter) {
if ($buckets[$i]['cnt']!==$buckes[$i]['max']) return $i; // if the bucket is not full, add entries here
}
}
return FALSE;
}
function filename_add($filename) {
global $buckets, $records, $records_cnt;
$first_letter = substr($filename,0,1);
$bucket_id = bucket_find($first_letter);
if ($bucket_id===FALSE) {
$bucket_id = bucket_create($first_letter);
}
$i = $buckets[$bucket_id]['cnt'];
$buckets[$bucket_id]['cnt']++;
$i++;
$buckets[$bucket_id]['entries'][$i] = $filename;
$records_cnt++;
$records[$records_cnt] = array('bucket'=> $bucket_id, 'cnt' => $buckets[$bucket_id]['cnt']);
}
function filename_find($filename) {
global $buckets,$buckets_cnt;
$first_letter = substr($filename,0,1);
if ($buckets_cnt==-1) return FALSE;
for ($i=$buckets_cnt;$i!=-1;$i--) {
if ($buckets[$i]['letter'] == $first_letter) {
if ($buckets[$i]['cnt']!=-1) {
for ($j=0;$j<=$buckets[$i]['cnt'];$j++) {
if ($buckets[$i]['entries'][$j] == $filename) return array('bucket'=>$i,'offset' => $j);
}
}
}
}
return FALSE;
}
$ret = filename_add('test.htm');
$ret = filename_add('rabbit.txt');
$ret = filename_add('test2.htm');
var_dump($buckets);
$ret = filename_find('test2.htm');
var_dump($ret);
var_dump($records);
?>
and here's the output...
The buckets array, you can see two buckets were created, one for the file names starting with "t" and one for the file names starting with "r" :
array(2) {
[0]=>
array(4) {
["letter"]=>
string(1) "t"
["cnt"]=>
int(1)
["max"]=>
int(1023)
["entries"]=>
array(2) {
[0]=>
string(8) "test.htm"
[1]=>
string(9) "test2.htm"
}
}
[1]=>
array(4) {
["letter"]=>
string(1) "r"
["cnt"]=>
int(0)
["max"]=>
int(1023)
["entries"]=>
array(1) {
[0]=>
string(10) "rabbit.txt"
}
}
}
and this is what the find file function returns , it finds the file name in first bucket ( 0 ) , second entry (the id 1 because entries start from 0) :
array(2) {
["bucket"]=>
int(0)
["offset"]=>
int(1)
}
and this is what the "records" array holds, the order in which the file names were added .. you can see (bucket: offset) pairs : (0:0) , (1:0) , (0:1)
array(3) {
[0]=>
array(2) {
["bucket"]=>
int(0)
["cnt"]=>
int(0)
}
[1]=>
array(2) {
["bucket"]=>
int(1)
["cnt"]=>
int(0)
}
[2]=>
array(2) {
["bucket"]=>
int(0)
["cnt"]=>
int(1)
}
}
You could use one byte for offsets (up to 256 entries per "bucket") and maybe 4 bytes for bucket id and then you'd have a maximum of 1,099,511,627,776 files that could be tracked by you in memory ( 2^32 x 256) ..... so in my particular situation with 560k files, there would probably be around 800 - 1000 buckets (if defined as having max 1024 files per bucket) and when you're searching for a file you may only have to search within 100 or so buckets instead of going through all buckets.