sorting algorithms are generally used at different steps in data processing. In many situations, the efficiency of the sorting algorithm used determines the throughput/execution speed of the application. Methods for implementing high speed sorting in hardware are often based on Batcher's Odd/Even sort or Bitonic sort algorithms. These algorithms are computation intensive and involve high number of logic gates to implement and high power consumption. The higher the number of logic gates, the more silicon area may be required and may lead to higher cost. Insertion sort is a sorting algorithm that is relatively simpler and may require fewer logic gates to implement. However, throughput achieved using Insertion sort algorithm is much lower than the throughput achieved using high speed sorting algorithms. A method and apparatus enable an efficient hardware design capable of simultaneously sorting multiple data inputs for high throughput at reduced complexity.
|
9. An apparatus for sorting n data elements into m most significant data elements in sorted order, wherein M<n, the apparatus comprising:
circuitry configured to control:
inputting L data elements of the n data elements at a same time into at least one sorting unit of a cascade of s sorting units of the circuitry, in which the L input data elements are sorted by a first comparator circuit of the circuitry only among the L input data elements before the inputting, in which the sorting units are arranged in the cascade in order of priority, in which each of the sorting units includes b registers for storing m/s data elements in sorted order and in order in relation to data elements respectively in the b registers of a neighbor sorting unit in the cascade, in which the s sorting units store a current set of most significant data elements of data elements previously input thus far, and in which s*B=M, B≥L and L≥1;
for each sorting unit of the cascade, sorting (i) each comparison data element determined to be inserted in the sorting based on a comparison by a second comparison circuit of the circuitry of each of the L input data elements with a most significant value of the b registers of the sorting unit and (ii) by a third comparison circuit of the circuitry the data elements respectively of the b registers of the sorting unit, to obtain a sorted array of data elements in order of significance; and
at a given sorting unit of the cascade,
storing, into the b registers in sorted order, data elements determined from
(i) when no shift data elements is output from a preceding neighbor sorting unit in the cascade, the b most significant values of the sorted array of data elements for the given sorting unit determined by a fourth comparison circuit of the circuitry, and
(ii) when SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements and a subset of the data elements of the sorted array determined by a fifth comparison circuit of the circuitry in accordance with the value of SH, in which 1≤SH≤B, and
outputting, as SHN shift-next data elements in order, SHN data elements from the sorted array of data elements more or less significant than the most or least significant data element of the sorted array stored in the b registers by the storing, in which 1≤SHN≤B.
1. A method for sorting, by circuitry, n data elements into m most significant data elements in sorted order, wherein M<n, the method comprising:
controlling, by a processing device, inputting L data elements of the n data elements at a same time into at least one sorting unit of a cascade of s sorting units, in which the L input data elements are sorted by a first comparator circuit of the circuitry only among the L input data elements before the inputting, in which the sorting units are arranged in the cascade in order of priority, in which each of the sorting units includes b registers for storing m/s data elements in sorted order and in order in relation to data elements respectively in the b registers of a neighbor sorting unit in the cascade, in which the s sorting units store a current set of most significant data elements of data elements previously input thus far, and in which s*B=M, B≥L and L≥1;
controlling, by the processing device, for each sorting unit of the cascade, sorting (i) each comparison data element determined to be inserted in the sorting based on a comparison by a second comparison circuit of the circuitry of each of the L input data elements with a most significant value of the b registers of the sorting unit and (ii) by a third comparison circuit of the circuitry the data elements respectively of the b registers of the sorting unit, to obtain a sorted array of data elements in order of significance; and
controlling, by the processing device, at a given sorting unit of the cascade,
storing, into the b registers in sorted order, data elements determined from
(i) when no shift data elements is output from a preceding neighbor sorting unit in the cascade, the b most significant values of the sorted array of data elements for the given sorting unit determined by a fourth comparison circuit of the circuitry, and
(ii) when SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements and a subset of the data elements of the sorted array determined by a fifth comparison circuit of the circuitry in accordance with the value of SH, in which 1≤SH≤B, and
outputting, as SHN shift-next data elements in order, SHN data elements from the sorted array of data elements more or less significant than the most or least significant data element of the sorted array stored in the b registers by the storing, in which 1≤SHN≤B.
17. A device comprising:
a processing device to receive data elements,
wherein the processing device includes circuitry is-configured to sort n data elements which are received into m most significant data elements in sorted order, wherein M<n, by controlling:
inputting L data elements of the n data elements at a same time into at least one sorting unit of a cascade of s sorting units of the processing device, in which the L input data elements are sorted by a first comparator circuit of the circuitry only among the L input data elements before the inputting, in which the sorting units are arranged in the cascade in order of priority, in which each of the sorting units includes b registers for storing m/s data elements in sorted order and in order in relation to data elements respectively in the b registers of a neighbor sorting unit in the cascade, in which the s sorting units store a current set of most significant data elements of data elements previously input thus far, and in which s*B=M, B≥L and L≥1;
for each sorting unit of the cascade, sorting (i) each comparison data element determined to be inserted in the sorting based on a comparison by a second comparison circuit of the circuitry of each of the L input data elements with a most significant value of the b registers of the sorting unit and (ii) by a third comparison circuit of the circuitry the data elements respectively of the b registers of the sorting unit, to obtain a sorted array of data elements in order of significance; and
at a given sorting unit of the cascade,
storing, into the b registers in sorted order, data elements determined from
(i) when no shift data elements is output from a preceding neighbor sorting unit in the cascade, the b most significant values of the sorted array of data elements for the given sorting unit determined by a fourth comparison circuit of the circuitry, and
(ii) when SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements and a subset of the data elements of the sorted array determined by a fifth comparison circuit of the circuitry in accordance with the value of SH, in which 1≤SH≤B, and
outputting, as SHN shift-next data elements in order, SHN data elements from the sorted array of data elements more or less significant than the most or least significant data element of the sorted array stored in the b registers by the storing, in which 1≤SHN≤B.
2. The method of
wherein, when the SH shift data elements is output from the preceding neighbor sorting unit in the cascade,
the SH shift data elements is stored into the b registers of the given sorting unit in sequence starting from a least or most significant register of the b registers, and
b-SH data elements from the sorted array, starting from the SH+1 data element of the sorted array, is stored in the b registers in sequence starting from the register of the b registers neighboring the register of the b registers in which the least or most significant of the shift data elements is stored.
3. The method of
wherein each of the s sorting units includes a load-shift control block and a value selector block as a part of the processing device, and is associated with an Internal Parallel Sorter (IPS) which is a part of the processing device,
wherein, in each of the s sorting units,
the load-shift control block controls the comparison of each of the L data elements with the most significant value of the b registers,
when at least one comparison data element is determined from the comparison, the IPS generates the sorted array including each comparison data element determined from the comparison and the data elements of the b registers in the order of significance, and
the value selector block selects b data elements for storing into the b registers in sorted order based on a number of data elements input into one or more sorting units of the cascade having one of higher and lower priority.
4. The method of
5. The method of
6. The method of
7. The method of
8. The method of
10. The apparatus of
wherein, when the SH shift data elements is output from the preceding neighbor sorting unit in the cascade,
the SH shift data elements is stored into the b registers of the given sorting unit in sequence starting from a least or most significant register of the b registers, and
b-SH data elements from the sorted array, starting from the SH+1 data element of the sorted array, is stored in the b registers in sequence starting from the register of the b registers neighboring the register of the b registers in which the least or most significant of the shift data elements is stored.
11. The apparatus of
wherein each of the s sorting units includes a load-shift control block and a value selector block, and is associated with an Internal Parallel Sorter (IPS) which is a part of the circuitry,
wherein, in each of the s sorting units,
the load-shift control block controls the comparison of each of the L data elements with the most significant value of the b registers,
when at least one comparison data element is determined from the comparison, the IPS generates the sorted array including each comparison data element determined from the comparison and the data elements of the b registers in the order of significance, and
the value selector block selects b data elements for storing into the b registers in sorted order based on a number of data elements input into one or more sorting units of the cascade having one of higher and lower priority.
12. The apparatus of
13. The apparatus of
14. The apparatus of
15. The apparatus of
16. The apparatus of
18. The device of
wherein, when the SH shift data elements is output from the preceding neighbor sorting unit in the cascade,
the SH shift data elements is stored into the b registers of the given sorting unit in sequence starting from a least or most significant register of the b registers, and
b-SH data elements from the sorted array, starting from the SH+1 data element of the sorted array, is stored in the b registers in sequence starting from the register of the b registers neighboring the register of the b registers in which the least or most significant of the shift data elements is stored.
19. The device of
wherein each of the s sorting units includes a load-shift control block and a value selector block, and is associated with an Internal Parallel Sorter (IPS) which is a part of the processing device,
wherein, in each of the s sorting units,
the load-shift control block controls the comparison of each of the L data elements with the most significant value of the b registers,
when at least one comparison data element is determined from the comparison, the IPS generates the sorted array including each comparison data element determined from the comparison and the data elements of the b registers in the order of significance, and
the value selector block selects b data elements for storing into the b registers in sorted order based on a number of data elements input into one or more sorting units of the cascade having one of higher and lower priority.
20. The device of
wherein, in the given sorting unit, when the SH shift data elements is output from the preceding neighboring sorting unit, the subset of the data elements of the sorted array does not include the SH data elements of the sorted array starting in sequence from a first or last data element of the sorted array.
|
Sorting algorithms may be used at different stages in many data processing systems. In many applications, the efficiency of the sorting algorithm used determines the throughput and the execution speed of the data processing systems. Methods and algorithms for implementing high speed sorting in hardware are often based on Batcher's Odd/Even sort algorithm or Bitonic sort algorithm as described in “Sorting Networks and their Applications,” K. E. Batcher, Proceedings of AFIPS Spring Joint Computing Conference, Vol. 32, 307-314, 1968.
Some sorting algorithms such as Quicksort and Heapsort that are efficient for software implementation are not suitable for hardware implementation because they have high algorithmic complexity and the execution may be limited to a single comparison operation at a time. Simpler sorting algorithms, which utilize the parallelism available in hardware implementation, perform better than these complex algorithms in hardware implementations.
The Batcher's Odd/Even sort algorithm is based on Merge sort and is data independent, i.e., the same comparisons are performed regardless of actual data. Merge sorting may be normally done by sorting its two halves and then merging the two sorted halves. In case of sorting N elements, Batcher's algorithm has a complexity of the order of N×(log N)2 and latency of (log N)2 because of the logic depth. Logic depth in a digital circuit is the maximum number of basic gates (AND, OR, INV, etc.) a signal needs to travel from source flip-flop to destination flip-flop.
There are other sorting algorithms based on Merge sort, such as Bitonic sorting and Shell sorting algorithms that have similar complexity of N×(log N)2 for sorting N elements. However, Batcher's Odd/Even merge sorting algorithm requires the fewest comparators when compared to Bitonic sorting algorithm and Shell sorting algorithm.
The complexity of Batcher's Odd/Even sorting algorithm increases rapidly with the number of elements to be sorted. For large values of N, excessive parallel comparisons may have to be performed. One of the methods to overcome this drawback is to group N values into disjoint sets of fewer elements and use resource-sharing techniques to reduce the complexity at the cost of throughput reduction. To operate at higher clock frequency, a pipelining technique may be used to reduce the critical path delay due to the logic depth. Registering intermediate results at each stage introduces latency. This method produces high throughput only when sorting independent N elements at each iteration. However pipelining may not be suitable for sorting progressive N inputs because each iteration result has to be merged with the previous sorted results. Pipelining delay may have a direct impact on the throughput.
The Insertion sorting method uses cascaded sorting units. A sorting unit comprises basic compare and swap units organized in such a way that input data is sorted as it streams through the pipeline. A single such sorting unit is shown in
The structure is easily scalable and requires minimal control circuitry to control the data movement. For example, to select M most significant elements out of N elements, M basic Insertion sort units are cascaded as shown in
The above Insertion sort method is capable of selecting M most significant elements from the incoming elements. The total number of elements N may be finite or the input elements may be arriving continuously in a streaming manner. The method continuously selects the M most significant elements from all the input elements at any given time and therefore it is referred as streaming sorter. However, the above architecture is capable of inserting only one element at a time. This method takes N clock cycles to sort N input elements.
Each insertion operation involves comparison of Rin with the elements present in each sorting unit, i.e., M comparisons. Note that as the Insertion process progresses, each element is inserted into an array that is already partially sorted. Hence, most of the comparison operations performed are redundant. At the end of N element insertion, a total of N*M comparisons may be performed.
Selecting M most significant elements out of N elements is a common problem faced in many data processing systems. In a case where N is a small quantity, Batcher's Odd/Even sorting algorithm may be used to obtain the desired performance. For large values of N, Insertion sort logic shown in
In accordance with an aspect of the present invention, a method may sort N data elements into M most significant data elements in sorted order, wherein M<N. The method may include: controlling, by a processing device, inputting L data elements of the N data elements at a same time into at least one sorting unit of a cascade of S sorting units, in which the L input data elements is sorted only among the L input data elements before the inputting, in which the sorting units are arranged in the cascade in order of priority, in which each of the sorting units includes B registers for storing M/S data elements in sorted order and in order in relation to data elements respectively in the B registers of a neighbor sorting unit in the cascade, in which the S sorting units store a current set of most significant data elements of data elements previously input thus far, and in which S*B=M, B≥L and L≥1; controlling, by the processing device, for each sorting unit of the cascade, sorting (i) each comparison data element determined to be inserted in the sorting based on a comparison of each of the L input data elements with a most significant value of the B registers of the sorting unit and (ii) the data elements respectively of the B registers of the sorting unit, to obtain a sorted array of data elements in order of significance; and controlling, by the processing device, at a given sorting unit of the cascade, storing, into the B registers in sorted order, data elements determined from (i) when no shift data elements is output from a preceding neighbor sorting unit in the cascade, the B most significant values of the sorted array of data elements for the given sorting unit, and (ii) when SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements and a subset of the data elements of the sorted array in accordance with the value of SH, in which 1≤SH≤B, and outputting, as SHN shift-next data elements in order, SHN data elements from the sorted array of data elements more or less significant than the most or least significant data element of the sorted array stored in the B registers by the storing, in which 1≤SHN≤B.
In one alternative, when the SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements may be stored into the B registers of the given sorting unit in sequence starting from a least or most significant register of the B registers, and B-SH data elements from the sorted array, starting from the SH+1 data element of the sorted array, may be stored in the B registers in sequence starting from the register of the B registers neighboring the register of the B registers in which the least or most significant of the shift data elements is stored.
In one alternative, each of the S sorting units may include a load-shift control block and a value selector block as a part of the processing device, and be associated with an Internal Parallel Sorter (IPS) which is a part of the processing device, wherein, in each of the S sorting units, the load-shift control block controls the comparison of each of the L data elements with the most significant value of the B registers, when at least one comparison data element is determined from the comparison, the IPS generates the sorted array including each comparison data element determined from the comparison and the data elements of the B registers in the order of significance, and the value selector block selects B data elements for storing into the B registers in sorted order based on a number of data elements input into one or more sorting units of the cascade having one of higher and lower priority.
In one alternative, in the given sorting unit, the sorted array may be generated by the IPS using Batcher's Odd/Even sorting algorithm when L is less than a predetermined value.
In one alternative, in the given sorting unit when L is less than B, an input of the IPS that is unused may be set to a maximum value to be discarded when the value selector block is selecting B data elements for storing into the B registers.
In one alternative, in each of the S sorting units, each comparison data element determined based on the comparison may be less than a maximum value or greater than a minimum value of the B registers for the sorting unit.
In one alternative, in the given sorting unit, when the SH shift data elements is output from the preceding neighboring sorting unit, the subset of the data elements of the sorted array may not include the SH data elements of the sorted array starting in sequence from a first or last data element of the sorted array.
In one alternative, the inputting of the L input data elements may be into each sorting unit of the cascade at the same time.
In accordance with an aspect of the present invention, an apparatus may sort N data elements into M most significant data elements in sorted order, wherein M<N. The may include circuitry configured to control: inputting L data elements of the N data elements at a same time into at least one sorting unit of a cascade of S sorting units of the circuitry, in which the L input data elements is sorted only among the L input data elements before the inputting, in which the sorting units are arranged in the cascade in order of priority, in which each of the sorting units includes B registers for storing M/S data elements in sorted order and in order in relation to data elements respectively in the B registers of a neighbor sorting unit in the cascade, in which the S sorting units store a current set of most significant data elements of data elements previously input thus far, and in which S*B=M, B≥L and L≥1; for each sorting unit of the cascade, sorting (i) each comparison data element determined to be inserted in the sorting based on a comparison of each of the L input data elements with a most significant value of the B registers of the sorting unit and (ii) the data elements respectively of the B registers of the sorting unit, to obtain a sorted array of data elements in order of significance; and at a given sorting unit of the cascade, storing, into the B registers in sorted order, data elements determined from (i) when no shift data elements is output from a preceding neighbor sorting unit in the cascade, the B most significant values of the sorted array of data elements for the given sorting unit, and (ii) when SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements and a subset of the data elements of the sorted array in accordance with the value of SH, in which 1≤SH≤B, and outputting, as SHN shift-next data elements in order, SHN data elements from the sorted array of data elements more or less significant than the most or least significant data element of the sorted array stored in the B registers by the storing, in which 1≤SHN≤B.
In one alternative of the apparatus, when the SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements may be stored into the B registers of the given sorting unit in sequence starting from a least or most significant register of the B registers, and B-SH data elements from the sorted array, starting from the SH+1 data element of the sorted array, may be stored in the B registers in sequence starting from the register of the B registers neighboring the register of the B registers in which the least or most significant of the shift data elements is stored.
In one alternative of the apparatus, each of the S sorting units may include a load-shift control block and a value selector block, and is associated with an Internal Parallel Sorter (IPS) which is a part of the circuitry, wherein, in each of the S sorting units, the load-shift control block controls the comparison of each of the L data elements with the most significant value of the B registers, when at least one comparison data element is determined from the comparison, the IPS generates the sorted array including each comparison data element determined from the comparison and the data elements of the B registers in the order of significance, and the value selector block selects B data elements for storing into the B registers in sorted order based on a number of data elements input into one or more sorting units of the cascade having one of higher and lower priority.
In one alternative of the apparatus, in the given sorting unit, the sorted array may be generated by the IPS using Batcher's Odd/Even sorting algorithm when L is less than a predetermined value.
In one alternative of the apparatus, in the given sorting unit when L is less than B, an input of the IPS that is unused may be set to a maximum value to be discarded when the value selector block is selecting B data elements for storing into the B registers.
In one alternative of the apparatus, in each of the S sorting units, each comparison data element determined based on the comparison may be less than a maximum value or greater than a minimum value of the B registers for the sorting unit.
In one alternative of the apparatus, in the given sorting unit, when the SH shift data elements is output from the preceding neighboring sorting unit, the subset of the data elements of the sorted array may not include the SH data elements of the sorted array starting in sequence from a first or last data element of the sorted array.
In one alternative of the apparatus, the inputting of the L input data elements may be into each sorting unit of the cascade at the same time.
In accordance with an aspect of the present invention, a device may include a processing device to receive data elements. The processing device may be configured to sort N data elements which are received into M most significant data elements in sorted order, wherein M<N, by controlling: inputting L data elements of the N data elements at a same time into at least one sorting unit of a cascade of S sorting units of the processing device, in which the L input data elements is sorted only among the L input data elements before the inputting, in which the sorting units are arranged in the cascade in order of priority, in which each of the sorting units includes B registers for storing M/S data elements in sorted order and in order in relation to data elements respectively in the B registers of a neighbor sorting unit in the cascade, in which the S sorting units store a current set of most significant data elements of data elements previously input thus far, and in which S*B=M, B≥L and L≥1; for each sorting unit of the cascade, sorting (i) each comparison data element determined to be inserted in the sorting based on a comparison of each of the L input data elements with a most significant value of the B registers of the sorting unit and (ii) the data elements respectively of the B registers of the sorting unit, to obtain a sorted array of data elements in order of significance; and at a given sorting unit of the cascade, storing, into the B registers in sorted order, data elements determined from (i) when no shift data elements is output from a preceding neighbor sorting unit in the cascade, the B most significant values of the sorted array of data elements for the given sorting unit, and (ii) when SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements and a subset of the data elements of the sorted array in accordance with the value of SH, in which 1≤SH≤B, and outputting, as SHN shift-next data elements in order, SHN data elements from the sorted array of data elements more or less significant than the most or least significant data element of the sorted array stored in the B registers by the storing, in which 1≤SHN≤B.
In one alternative of the device, when the SH shift data elements is output from the preceding neighbor sorting unit in the cascade, the SH shift data elements may be stored into the B registers of the given sorting unit in sequence starting from a least or most significant register of the B registers, and B-SH data elements from the sorted array, starting from the SH+1 data element of the sorted array, may be stored in the B registers in sequence starting from the register of the B registers neighboring the register of the B registers in which the least or most significant of the shift data elements is stored.
In one alternative of the device, each of the S sorting units may include a load-shift control block and a value selector block, and be associated with an Internal Parallel Sorter (IPS) which is a part of the processing device, wherein, in each of the S sorting units, the load-shift control block controls the comparison of each of the L data elements with the most significant value of the B registers, when at least one comparison data element is determined from the comparison, the IPS generates the sorted array including each comparison data element determined from the comparison and the data elements of the B registers in the order of significance, and the value selector block selects B data elements for storing into the B registers in sorted order based on a number of data elements input into one or more sorting units of the cascade having one of higher and lower priority.
In one alternative of the device, in the given sorting unit, when the SH shift data elements is output from the preceding neighboring sorting unit, the subset of the data elements of the sorted array may not include the SH data elements of the sorted array starting in sequence from a first or last data element of the sorted array.
The foregoing aspects, features and advantages of the present invention will be further appreciated when considered with reference to the following description of exemplary embodiments and accompanying drawings, wherein like reference numerals represent like elements. In describing the exemplary embodiments of the invention illustrated in the appended drawings, specific terminology will be used for the sake of clarity. However, the aspects of the invention are not intended to be limited to the specific terms used.
In many applications, it is required to extract the M smallest or largest elements from a set of N elements. The total number of elements N in a set may be infinite in theory or very large in practice. The number of smallest or largest elements M may be generally much smaller.
The elements to be sorted may be available all at once as a block or may become available one at a time in a serial manner. Other intermediate scenarios where small sets of the elements to be sorted become available at once are also possible. For the description of the present invention, let the number of elements that are input at a time to the sorting apparatus be denoted by L.
According to an aspect of the present invention, a hybrid streaming sorter 400 based on the combination of several small parallel sorting units 410 and several insertion sort units 420 is disclosed. The block diagram of the high speed streaming sorter according to the aspects of the present invention is shown in
According to an aspect of the present invention, the L input elements may be sorted amongst themselves in a parallel sorting block 402 before inserting them into the current set of M most significant elements as shown in
According to another aspect of the present invention, the M elements are grouped into disjoint segments such that each segment contains B elements. The value B may be chosen in such a way that B≥L and M is an integral multiple of B. Grouping of M elements into disjoint segments is shown in
According to an aspect of the innovation, the hybrid streaming sorter may only require S*L comparisons to be performed per L input values, whereas in a conventional insertion sorting method a total of M*L comparisons may be performed per L input values over L clock cycles. The total number of comparisons performed may be reduced by a factor of B. The choice of B may be flexible, and based on the application the HSU architecture may be modified accordingly.
According to an aspect of the present invention, the Load-Shift Control block 602 compares the L input elements with the maximum value out of the existing B elements in the unit. Based on the resultant comparison metric, Load-Shift Control block 602 determines the following:
According to an aspect of the present invention, the Data Insertion into an HSU occurs based on priority. Data Insertion in HSU(1) gets higher priority than HSU(2), which in turn gets higher priority than HSU(3) and so on until all the HSUs are considered. Hence, the number of elements to be inserted into HSU(k) depends on the number of L input elements inserted in HSU(1) to HSU(k−1) and the comparison metric of HSU(k).
According to an aspect of the present invention, if no Data Insertion takes place inside an HSU, the existing B elements in that HSU remain in sorted order. According to an aspect of the present invention, in case Data Insertion is performed in an HSU, the elements present in all the following HSUs in a pipeline may be shifted in such a way that elements with maximum values are discarded. According to an aspect of the present invention, each of the HSUs has B elements in sorted order as the HSU gets new elements from a partially sorted array except for the elements present in an HSU where Data Insertion takes place.
As an example, suppose two elements are inserted in HSU(k−1). In this case, two maximum valued elements out of the total (B+2) elements are moved from HSU(k−1) to the HSU(k). Now in HSU(k), two maximum valued elements out of (B+2) are shifted to HSU(k+1) and so on. As seen in this example, the IPS may be used to sort the elements of HSU(k−1) to find two maximum elements to be moved to HSU(k). However in HSU(k) and HSU(k+1), new data elements are inserted which were part of partially sorted array. Hence, the B elements of HSU(k) and HSU(k+1) are still in sorted order.
Data Insertion may take place in different ways. For example, all the L input elements may be inserted into a single HSU or each element out of L input elements may be inserted in different HSUs. According to an aspect of the present invention, to handle the worst-case scenario of Data Insertion, L IPSs, which are capable of sorting (L+B) elements at a time, are used as shown in
According to an aspect of the present invention, the Value Selector block 604 selects the B elements to be stored into the registers of the HSU after Data Insertion and internal parallel sorting operation. The Value Selector selects B elements to be stored in HSU(k) based on the number of inputs inserted in HSU(0) to HSU(k). In
To illustrate the interaction of the blocks in
The Load-Shift Control block 602 provides to the Internal Parallel Sorter 608 the B=4 sorted contents and two new sorted elements to be sorted, i.e., the set [2 4 6 8] and the set [0 5]. The Internal Parallel Sorter 608 operates on these two already sorted arrays and outputs a single sorted array of elements, i.e., [0 2 4 5 6 8] which is input to the Value Selector block 604. The Value Selector block 604 selects the B=4 smallest elements and stores them in the B registers of HSU(k). The Value Selector block 604 shifts the remaining two elements [6 8] out to the next HSU(k+1).
To further illustrate the interaction of the blocks in
A further detailed structure for Value selector block 604 of HSU(k) in
Turning to the HSU(k−1) portion of
The data Insertion step performed within an HSU is based on the case that the B elements present in an HSU are in sorted order. After performing Data Insertion of L input elements into the S HSUs, the elements within each HSU in general may not be in sorted order. Hence, the IPS 608 is used which can sort the elements of an HSU and the new input elements inserted in the same HSU. Batcher's Odd/Even sorting algorithm may be used to get higher performance with low complexity for the IPS with smaller number elements.
After Data Insertion, there are two arrays, one sorted array with B elements and another one with up to L elements. In case the number of elements inserted in an HSU is less than L, the unused inputs of the Parallel Internal sorter may be set to a maximum value that may be discarded during Value Selector operation. If the array with L elements is in sorted order, then the task of the IPS block 608 reduces to a simple merge operation of two sorted arrays with B elements each (since B≥L, merge operation with larger of the two lengths is considered). In the worst case, merge operation of two arrays with B elements requires (3B−1) comparators, i.e., B comparators for sorting and (2B−1) comparators for merging the results of sort operation.
Each register in an HSU(k) is connected to the other registers of HSU(k), HSU(k−1) and IPS serving HSU(k) and HSU(k−1) through the Value Selector block as shown in
The table contained in
In scenario 1 of the table contained in
In scenario 2 of the table contained in
In scenario 3 of the table contained in
In scenario 4 of the table contained in
The rest of the scenarios listed in the table contained in
In general, for any register in HSU(k), each register takes a value from the registers of the HSU(k), if number of shifts to be performed in HSU(k) is less than the total number of registers within HSU(k) that has values smaller than the register value under consideration. If number of shifts to be performed in HSU(k) is more than the total number of registers within HSU(k) that has values smaller than the register value under consideration, then the register takes the values from HSU(k−1).
The number of elements in a segment, i.e., value of B, defines the number of HSUs in the system. Total number of data comparisons performed per clock cycle is S*L, i.e., (M/B)*L. Increasing number of elements in a segment reduces the total number of comparisons to be performed. This in turn simplifies the design of the controller used to control the data flow. However, with increase in B, the parallel sorting block complexity increases. This results in increased hardware resources.
The number of inputs considered per iteration i.e., L defines the throughput of the system. When L<<N this architecture provides significant improvement in terms of hardware resources and power consumption. With increase in L, overall comparisons performed in HSUs increase. However, the throughput of the system is also higher. With smaller values of L, the complexity of the method decreases significantly and throughput reduces.
The number of IPS (Merge operations) units required for sorting the intermediate results in each HSU depends on the choice of B and L. When S>L, L IPS units which can sort (L+B) elements at a time are required. The complexity of the IPS units increases with increase in B and L. When S<L, S Parallel Sorting blocks are required and complexity increases with the value of B. In order to operate at higher clock frequencies both B and L are expected to have smaller values in practice. However, to meet the desired complexity and performance tradeoffs, the method presented according to aspects of the invention can be applied to any combination of values of B, L, and S for a given value of M as long as B L.
The parallel sorting algorithms such as Batcher's Odd/Even sort can be used to produce high throughput. However, Batcher's Odd/Even sort algorithm can only be applied to arrays with the number of elements being a power of two, i.e., 2j where j≥2. Hence the value of N should be rounded to the nearest power of 2 such that 2j=1<N≤2j, which results in redundant comparisons. In addition, the number of comparators required for sorting increases rapidly with increase in N. For an increase in N by a factor of two, the logic depth may increase by 2*log2(N). This may reduce the clock frequency of operation. Introducing pipelining to break the logic depth may decrease the throughput by a factor proportional to the number of pipelining stages introduced. Overall, the complexity and logic depth, clock speed tradeoffs become less practical as the value of N grows.
The disclosed method provides an advantageous tradeoff between complexity, logic depth and the achievable clock speed. Furthermore, the disclosed method offers flexibility to choose B and L that may suit the target application.
To select M elements, the disclosed hybrid streaming sorter requires B times less comparisons in contrast to the number of comparators required for Insertion sort. The number of comparators required in IPS is small for smaller values of B. In addition, the hybrid streaming sorter is L times faster than the Insertion Sort. Hence, the hybrid streaming sorter offers both complexity and power consumption advantages over an Insertion sort. In contrast to the parallel sorting algorithms, the number of comparators required remains the same for a given M, B and L values and does not depend on the value of N, whereas it grows rapidly with increase in N in the former case. For the disclosed hybrid streaming sorter, increase in N does not have any impact on the throughput or operating frequency, whereas in parallel sorting algorithm operating frequency reduces with increase in N.
Aspects of the present invention may be implemented in firmware of a micro-processor or micro-controller. In another alternative, aspects of the present invention may also be implemented as any combination of firmware, software and hardware running on a controller, such as a computer processing unit (CPU) or circuitry. The hardware may be an application specific integrated circuit (ASIC), field programmable gate array (FPGA), discrete logic components or any combination of such devices.
Patel, Bhaskar, Bhat, Raghavendra H.
Patent | Priority | Assignee | Title |
11900498, | Mar 19 2020 | Intel Corporation | Apparatus and method for performing a stable and short latency sorting operation |
Patent | Priority | Assignee | Title |
7756085, | Nov 20 2001 | Qualcomm Incorporated | Steps one and three W-CDMA and multi-mode searching |
8335782, | Oct 29 2007 | Hitachi, Ltd. | Ranking query processing method for stream data and stream data processing system having ranking query processing mechanism |
9298419, | Sep 27 2006 | SAP SE | Merging sorted data arrays based on vector minimum, maximum, and permute instructions |
20150046453, | |||
20150046478, | |||
20160188644, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Oct 21 2015 | BHAT, RAGHAVENDRA H | MBIT WIRELESS, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036913 | /0180 | |
Oct 21 2015 | PATEL, BHASKAR | MBIT WIRELESS, INC | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036913 | /0180 | |
Oct 28 2015 | MBIT WIRELESS, INC. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Mar 22 2022 | M2551: Payment of Maintenance Fee, 4th Yr, Small Entity. |
Date | Maintenance Schedule |
Oct 16 2021 | 4 years fee payment window open |
Apr 16 2022 | 6 months grace period start (w surcharge) |
Oct 16 2022 | patent expiry (for year 4) |
Oct 16 2024 | 2 years to revive unintentionally abandoned end. (for year 4) |
Oct 16 2025 | 8 years fee payment window open |
Apr 16 2026 | 6 months grace period start (w surcharge) |
Oct 16 2026 | patent expiry (for year 8) |
Oct 16 2028 | 2 years to revive unintentionally abandoned end. (for year 8) |
Oct 16 2029 | 12 years fee payment window open |
Apr 16 2030 | 6 months grace period start (w surcharge) |
Oct 16 2030 | patent expiry (for year 12) |
Oct 16 2032 | 2 years to revive unintentionally abandoned end. (for year 12) |