Self organizing list

A self-organizing list is a list that reorders its elements based on some self-organizing heuristic to improve average access time. The aim of Self organizing list is to improve efficiency of linear search by moving more frequently accessed items towards the head of the list.

Some say, the "Self Organizing List" is a poor man's Hash table (see Gwydion Dylan Library Reference Guide). By using a probabilistic strategy, it yields nearly constant time in the best case for insert/delete operations, although the worst case remains linear.

There are four ways in which the list can be self organized:

Ordering—The list is ordered using certain criteria natural for the information under scrutiny.
Transpose—After the desired element is located, swap it with its predecessor unless it is at the head of the list.
Move to front—After the desired element is located, put it at the beginning of the list
Count—Order the list by the number of times elements are being accessed.

Basic Implementation of a List

In order to efficiently use memory resources, lists are generally implemented as linked lists allowing dynamic memory allotment at runtime. The linked list consists of a sequence of nodes, each 'linked' to the node in front (for example, in C, the 'link' is in the form of a pointer to the succeeding node), with the last node storing a link to NULL. An array is generally considered to be unsuitable because of the requirement of knowing the array size beforehand (during declaration). Also, inserting elements into an array is an inefficient operation.

Linked List Traversals

As each node is 'linked' only to the node in front of it, the only way to access a particular node in the list is by accessing the node before it, this process being repeated. So, to reach the nth element in the list, we must start at the first node and move to the second, from the second move to the third, from the third to the fourth and so on till the desired element is reached (move through each node successively to reach the nth element). When compared to an array it is immediately obvious that the linked list is extremely inefficient in random access (in an array say a[], the nth element can be immediately accessed as a[n]). In big O notation, retrieving the nth element in a linked list is a O(n) operation whereas in an array, it is a O(1) operation. For a linked list, even a binary search routine is inefficient because finding the middle element as the binary search algorithm requires is exceedingly costly (due to having to traverse through all the elements before it) and a linear traversal through the list is generally adopted for random access or searches. (Binary search on an array is O(log n)).

Concept of the Self Organizing List

A self organizing list modifies the order in which the records are stored, based on their actual or expected access pattern. The basis of a self organizing list can be explained in terms of the 80-20 rule which states that 80% of the records in a list or data store are accessed 20% of the time whereas 20% of the records are accessed 80% of the time (note that this is not a strict rule but merely a characterization of typical behavior). The self organizing list aims at keeping these commonly accessed elements at the head for quicker access.

Analysis of Running Times for Access/ Search in a List

Consider a list (self organized or randomly arranged) implemented using a linked list structure containing n elements. The COST of a search is measured by the number of comparisons that must be performed to find the target record (for now we assume that the target record exists in the list).
We consider the worst and average case running times of accessing a particular element by linear searching in the list as follows:

Average Case:

In the average case, suppose that N search operations are performed on the list. Suppose that the number of times the i^th element was searched for is freq(i) (for example, if the third element was searched for 45 times out of N, freq(3) = 45). then

freq(1)+freq(2)+...+freq(n)=N

Now, the average running time to search for (or retrieve) the i^th element is i. thus, the time spent in retrieving only the i^th element is equal to

(number of times i^th element was searched for) * (the time required to search for the i^th element) = (freq(i) * i).

hence, the total time for N searches is given by:

T=freq(1)*1+freq(2)*2+...+freq(n)*n.

The average time per search operation is obtained by dividing above expression by N.

Tavg=T/N=1*freq(1)/N+2*freq(2)/N+...+n*freq(n)/N.

but now consider, for the i^th element, freq(i)/N is the probability of accessing the i^th element (p(i)) and thus

freq(i)/N=p(i).

Substituting,

Tavg=1*p(1)+2*p(2)+3*p(3)+...+n*p(n).

If the access probability of each element is the same (i.e p(1) = p(2) = p(3) = ... = p(n) = 1/n) then the ordering of the elements is irrelevant and the average time complexity is given by

T(n)=1/n+2/n+3/n+...+n/n=(1+2+3+...+n)/n=(n+1)/2

and T(n) does not depend on the individual access probabilities of the elements in the list in this case. however in the case of searches on lists with non uniform record access probabilities (i.e those lists in which the probability of accessing one element is different from another), the average time complexity can be reduced drastically by proper positioning of the elements contained in the list.
This is done by pairing smaller i with larger access probabilities so as to reduce the overall average time complexity. Consider for example the list shown below:
Given List: A(0.1), B(0.1), C(0.3), D(0.1), E(0.4)
Arrangement 1 (Random order) : A(0.1), B(0.1), C(0.3), D(0.1), E(0.4)
Without rearranging, average search time required is:

T(n)=1*0.1+2*0.1+3*0.3+4*0.1+5*0.4=3.6

Now suppose the nodes are rearranged so that those nodes with highest probability of access are placed closest to the front so that the rearranged list is now:
Arrangement 2 (elements with highest probability at the front) :
E(0.4), C(0.3), D(0.1), A(0.1), B(0.1)
Here, average search time is:

T(n)=1*0.4+2*0.3+3*0.1+4*0.1+5*0.1=2.2

Thus the average time required for searching in an organized list is (in this case) around 40% less than the time required to search a randomly arranged list. This is the concept of the self organized list in that the average speed of data retrieval is increased by rearranging the nodes according to access frequency.

Worst Case:

In the worst case, the element to be located is at the very end of the list (for both self organized or randomly arranged lists) and thus n comparisons must be made to reach it. Therefore the worst case running time of a linear search on the list is O(n) independent of the type of list used. Note that the above argument for the average search time is a probabilistic one. Keeping the commonly accessed elements at the head of the list simply reduces the probability of the wost case occurring but does not eliminate it completely. Even in a self organizing list, if a lowest access probability element (obviously located at the end of the list) is to be accessed, the entire list must be traversed completely to retrieve it. This is the worst case search.

Techniques for Rearranging Nodes

1. Move to Front Method

Any node or element requested is moved to the front of the list.

Pros:

1. This method is easily implemented and does not require any extra memory or storage(for counter variables say)
2. This method easily adapts to quickly changing access patterns. even if the priorities of the nodes change dynamically at runtime, the list will reorganize itself very quickly in response.

Cons:

1. This method may prioritize infrequently accessed nodes: for example, if a uncommon node is accessed even once, it is moved to the head of the list and given maximum priority even if it is not going to be accessed frequently in the future. these 'over rewarded' nodes clog up the list and lead to slower access times for commonly accessed elements.
2. This method, though easily adaptilble to changing access patterns may change too commonly. basically, this technique leads to very short memories of access patterns whereby even an optimal arrangement of the list can be disturbed immediately by accessing an infrequent node in the list.

2. Count Method:

Each node counts the number of times it was searched for i.e every node keeps a separate counter variable which is incremented every time it is called. the nodes are then rearranged according to decreasing count.

Pros:

1. Reflects more realistically the actual access pattern

Cons:

1. Must store and maintain a counter for each node, thereby increasing the amount of memory required.
2. Does not adapt quickly to rapid changes in the access patterns. for example: if the count of the head element is say A is 100 and for any node after it say B is 40. now, even if B becomes the new most commonly accessed element, it must still be accessed at least (100 - 40 = 60) times before it can become the head element.

3. Transpose Method:

Pros:

1. Easy to implement and requires little memory.
2. More likely to keep frequently accessed nodes at the front.

Cons:

1. More cautious than move to front. i.e it will take many accesses to move the element to the head of the list.

References

NIST DADS entry
A Drozdek, Data Structures and Algorithms in Java Third edition