A method determines a configuration for inter-processor communication for a heterogeneous multi-processor system. The method determines at least one subgraph of a graph representing communication between processors of the heterogeneous multi-processor system. For each subgraph the method (i) determines a plurality of subgraph design points. Each subgraph design point has a variation of channel mapping between any two of the processors in the subgraph by selecting from first-in-first-out (FIFO) memory and shared cache, and varying the shared cache and a local memory associated with at least one of the processors according to the channel mapping; and (ii) selects a memory solution for the subgraph, based on a cost associated with the selected memory solution. The method then determines a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the configuration for inter-processor communication for the heterogeneous multi-processor system.
|
1. A method of determining a configuration for inter-processor communication for a heterogeneous multi-processor system, the method comprising:
determining subgraphs of a graph representing communication between processors of the heterogeneous multi-processor system, the subgraphs being determined by grouping communication channels carrying same data, the communication channels being sorted based on an estimate of time a processor may spend for communication of data as a receiver processor;
determining a memory solution for each of the subgraphs by exploring a design space of the subgraphs in isolation, the design space being variations of first-in-first-outs (FIFOs) and shared caches;
wherein the determining of the memory solution for each of the subgraphs comprises:
determining a plurality of subgraph design points for said subgraph, each of the subgraph design points having a variation of channel mapping between any two of the processors in the subgraph by selecting from the design space and a local memory associated with at least one of the processors according to the channel mapping; and
selecting a memory solution for said subgraph, from the plurality of determined subgraph design points, based on a cost associated with the selected memory solution; and
determining a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the memory configuration for inter-processor communication for the heterogeneous multi-processor system.
18. A heterogeneous multi-processor system having an inter-processor communication memory storing a computer-executable program of instructions for causing system to perform:
determining at least one subgraph of a graph representing communication between processors of the heterogeneous multi-processor system, the subgraph being determined by grouping communication channels carrying same data, the communication channels being sorted based on an estimate of time a processor may spend for communication of data as a receiver processors;
determining a memory solution for each of the subgraphs by exploring a design space of the subgraphs in isolation, the design space being variations of first-in-first-outs (FIFOs) and shared caches;
wherein the determining of the memory solution for each of the subgraphs comprises:
determining a plurality of subgraph design points for said subgraph, each of the subgraph design points having a variation of channel mapping between any two of the processors in the subgraph by selecting from the design space and a local memory associated with at least one of the processors according to the channel mapping; and
selecting the memory solution for said subgraph, from the plurality of determined subgraph design points, based on a cost associated with the selected memory solution; and
determining a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the memory configuration for inter-processor communication for the heterogeneous multi-processor system.
10. A non-transitory computer readable storage medium having a program recorded thereon, the program being executable by a processor to determine a configuration for inter-processor communication for a heterogeneous multi-processor system, the program comprising:
code for determining at least one subgraph of a graph representing communication between processors of the heterogeneous multi-processor system, the subgraph being determined by grouping communication channels carrying same data, the communication channels being sorted based on an estimate of time a processor may spend for communication of data as a receiver processors;
code for determining a memory solution for each of the subgraphs by exploring a design space of the subgraphs in isolation, the design space being variations of first-in-first (FIFOs) and shared caches,
wherein the code for determining the memory solution for each of the subgraphs comprises:
code for determining a plurality of subgraph design points for said subgraph, each of the subgraph design points having a variation of channel mapping between any two of the processors in the subgraph by selecting from the design space and a local memory associated with at least one of the processors according to the channel mapping; and
code for selecting the memory solution for said subgraph, from the plurality of determined subgraph design points, based on a cost associated with the selected memory solution; and
code for determining a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the configuration for inter-processor communication for the heterogeneous multi-processor system.
20. A computer system having at least a processor, a non-transitory memory storing a program recorded on the memory, the program being executable by the processor to determine a memory configuration for inter-processor communication for a heterogeneous multi-processor system for causing the computer system to perform a method comprising:
determining subgraphs of a graph representing communication between processors of the heterogeneous multi-processor system, the subgraph being determined by grouping communication channels carrying same data, the communication channels being sorted based on an estimate of time a processor may spend for communication of data as a receiver processor;
determining a memory solution for each of the subgroups by exploring a design space of the subgraphs in isolation, the design space being variations of first-in-first-outs (FIFOs) and shared caches;
wherein the determining of the memory solution for each of the subgroups comprises:
determining a plurality of subgraph design points for said subgraph, each of the subgraph design points having a variation of channel mapping between any two of the processors in the subgroup by selecting from the design space and a local memory associated with at least one of the processors according to the channel mapping; and
selecting the memory solution for said subgraph, from the plurality of determined subgraph design points, based on a cost associated with the selected memory solution; and
determining a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the memory configuration for inter-processor communication for the heterogeneous multi-processor system.
2. The method according to
3. The method according to
4. The method according to
5. The method according to
6. The method according to
7. The method according to
8. The method according to
9. The method according to
11. The non-transitory computer readable storage medium according to
12. The non-transitory computer readable storage medium according to
13. The non-transitory computer readable storage medium according to
14. The non-transitory computer readable storage medium according to
15. The non-transitory computer readable storage medium according to
16. The non-transitory computer readable storage medium according to
17. The non-transitory computer readable storage medium according to
19. The heterogeneous multi-processor system according to
|
This application claims the benefit under 35 U.S.C. §119 of the filing date of Australian Patent Application No. 2014203218, filed Jun. 13, 2014, hereby incorporated by reference in its entirety as if fully set forth herein.
The present invention relates to automation tools for designing digital hardware systems in the electronics industry and, in particular, to the memory configuration for inter-processor communication in multiprocessor system-on-chip (MPSoC).
The continuous increase in transistor density on a single die has enabled integration of more and more components in a system-on-chip (SoC), such as multiple processors, memories, etc. Although the integration of more and more components has significantly improved the intrinsic computational power of SoCs, such integration has also significantly increased the design complexity. Continuously increasing design complexity is exacerbating the well-known issue of design productivity gap. To meet time-to-design and time-to-market deadlines, industry is gradually shifting towards the use of automation tools at a higher level of design abstraction.
Heterogeneous multiprocessor system-on-chip (MPSoC) devices integrate multiple different processors to handle the high performance requirements of applications. An MPSoC primarily consists of multiple computational elements, examples of which include general-purpose processors, application-specific processors, and custom hardware accelerators, and communication channels. Hereafter in this document, such computational elements are collectively referred to as processors. A communication channel connects two processors, where a first processor, in this instance operating as a “sender-processor”, sends data and a second processor, in this instance operating as a “receiver-processor”, receives the data. Communication channels in an MPSoC can be implemented using first-in-first-out (FIFO) memory, shared memory, shared cache etc. Processors can also have private on-chip local memory (LM) used as a scratchpad memory for temporary storage of data. The mapping of communication channels can influence the size of LM associated with a receiver-processor. Memory configuration, including FIFOs, shared memory, shared cache and LMs, used for data communication contributes significantly to the overall area and performance of an MPSoC. A complex MPSoC can have a large number of communication channels between processors. The design space for memory configuration for inter-processor communication is defined as all the possible combinations of the implementation of communication channels along with the variations of LMs connected to the processors. One combination of the implementation of communication channels, along with a selected size of LMs for all the processors, represents one design point.
Mapping a complex streaming application on to an MPSoC to achieve performance requirements can be a very time intensive task. There has been an increased focus on automating the implementation of streaming multimedia applications on MPSoC platforms.
In one known method, an area of a pipelined MPSoC is optimized under latency or throughput constraint using an integer linear programming (ILP) approach for a multimedia application. The optimization method assumes that data communication between processors is achieved by using queues implementing FIFO protocol. The size of the queues is sufficiently large to hold an iteration of a data block, which can vary depending on the application. For example, a data block may include a group of pixels of an input image stream needed by any processor to independently execute the task mapped on the processor. The size of queues can significantly increase the area of an MPSoC for applications having a large data block size, which as a result, increases the cost of the MPSoC.
In another method, a design space exploration approach using linear constraints and a pseudo Boolean solver is proposed for optimization of the topology and communication routing of a system. Communication channels are commonly restricted to be mapped to memory resources. This approach does not consider multiple levels of memory hierarchies involving shared caches. Shared caches are on-chip memories which contain a subset of the contents of the external off-chip memory and provide better performance in comparison to the use of external off-chip memories alone. Not including shared caches may result in a significant increase in the on-chip memory area for a range of applications.
In another method, memory aware mapping of applications onto MPSoCs is proposed using evolutionary algorithms. Memory resources include private memories and shared memories. The limitation of this approach is that the method maps the application on a fixed memory platform, which is provided as an input to the method. In addition, the memory platform does not include shared caches. Including shared caches in the design space provides the flexibility to map communication data to off-chip memories and reduce on-chip memory area.
In another method, memory mapping is automatically determined and generated to provide an optimization of execution of the program on the target device. The memory mapping includes a description of the placement of the uniquely named memory sections to portions of the one or more memory elements of the target device. One limitation of this approach is that that the approach optimizes the memory mapping for a fixed memory platform, which is provided as an input to the method.
The memory configuration for inter-processor communication (“MC-IPrC”) can have a significant impact on the area and performance of an MPSoC. There is a need for design automation methods to consider MC-IPrC including FIFOs, shared caches and local memories when mapping streaming applications onto MPSoCs.
The present disclosure focuses on how to efficiently and effectively sample the design space in order to determine MC-IPrC in an MPSoC to satisfy performance requirement of the application. Disclosed are arrangements to determine memory configuration for inter-processor communication (MC-IPrC) for a heterogeneous multiprocessor system-on-chip (MPSoC). According to this disclosure, the design space is efficiently sampled to determine a MC-IPrC for an MPSoC to satisfy the constraints provided by the designer. MC-IPrC includes FIFOs connecting processors, shared caches and local memories of the processors. The disclosed arrangements focus on determining a MC-IPrC based on the subgraph memory solutions, which are determined by exploring subgraph design space in isolation. Earlier methods either do not consider shared caches in the memory hierarchies or map the application on a fixed memory platform, which limits their use.
According to one aspect of the present disclosure there is provided method of determining a configuration for inter-processor communication for a heterogeneous multi-processor system, the method comprising:
determining at least one subgraph of a graph representing communication between processors of the heterogeneous multi-processor system;
for each said subgraph:
determining a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the configuration for inter-processor communication for the heterogeneous multi-processor system.
In one embodiment, the determining of the configuration for the graph comprises combining recursively combining subgraphs subject to a combination cost.
In one embodiment the determining the configuration of the graph comprises combining the memory solutions for each of the subgraphs in a pool in a single step.
Typically the subgraphs are created based on common data transferred between processors.
In one embodiment the subgraphs are created based communication channels associated with one of a sender-processor or a receiver-processor.
In a specific implementation, the cost associated with the selected memory solution comprises a combination cost associated with the area of on-chip memory consumed for a combination of particular subgraphs. In one embodiment, the combination cost is associated with an area saving for on-chip memory associated with a performance constraint.
Beneficially the cost associated with the selected memory solution is associated with energy savings under a performance constraint.
According to another aspect of the present disclosure there is provided a method of determining a configuration for inter-processor communication for a heterogeneous multi-processor system, the method comprising:
determining at least one subgraph of a graph representing communication between processors of the heterogeneous multi-processor system, the subgraph being determined based on common data;
for each said subgraph:
determining a configuration for the graph of the heterogeneous multi-processor system, based on the selected memory solutions, to determine the configuration for inter-processor communication for the heterogeneous multi-processor system.
According to another aspect, disclosed is a heterogeneous multi-processor system having an inter-processor communication configuration formed according to the methods described herein.
Other aspects are also disclosed.
At least one embodiment of the present invention will now be described with reference to the following drawings, in which:
Context
The mapping of streaming applications on MPSoCs involves parallelisation of application software, mapping of software on to multiple processors, and the mapping of communication channels between processors on appropriate resources. Data communication between processors is generally restricted to be mapped on to MPSoC memory subsystems including queues connecting processors, shared memories, shared caches and local memories. A complex MPSoC can have a large number of communication channels resulting in a huge design space for memory configurations for inter-processor communication (MC-IPrC). Hence, it takes a long time to select a MC-IPrC for an MPSoC to satisfy requirements of a designer that include area, performance etc. The disclosed arrangements address this issue to effectively select the MC-IPrC by efficiently and effectively sampling the design space.
Overview
Proposed is a method to determine the memory configuration for inter-processor communication (MC-IPrC) in an MPSoC. MC-IPrC specifies multi-level memory hierarchy including local memories (LM), first-in-first-out (FIFO) memories and shared caches. LM is a private memory connected to the processor. FIFOs connect two processors, where a sender-processor produces data and a receiver-processor consumes the data. Shared caches are accessible by a plurality of processors enabling communication between them. Shared caches are on-chip memories which contain a subset of the contents of the external off-chip memory and provide better performance in comparison to external off-chip memories.
Due to the large design space, it is not feasible to determine a MC-IPrC which satisfies constraints of a designer by performing an exhaustive search. The presently disclosed method efficiently and effectively samples the design space to determine MC-IPrC for an MPSoC. The described method operates by dividing the MC-IPrC design space into multiple sub-divided design spaces, which are explored in isolation. The memory configuration for inter-processor communication, MC-IPrC, for an MPSoC is determined based on the results obtained by exploring sub-divided design spaces.
As seen in
The computer module 1201 typically includes at least one processor unit 1205, and a memory unit 1206. For example, the memory unit 1206 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1201 also includes an number of input/output (I/O) interfaces including: an audio-video interface 1207 that couples to the video display 1214, loudspeakers 1217 and microphone 1280; an I/O interface 1213 that couples to the keyboard 1202, mouse 1203, scanner 1226, camera 1227 and optionally a joystick or other human interface device (not illustrated); and an interface 1208 for the external modem 1216 and printer 1215. In some implementations, the modem 1216 may be incorporated within the computer module 1201, for example within the interface 1208. The computer module 1201 also has a local network interface 1211, which permits coupling of the computer system 1200 via a connection 1223 to a local-area communications network 1222, known as a Local Area Network (LAN). As illustrated in
The I/O interfaces 1208 and 1213 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1209 are provided and typically include a hard disk drive (HDD) 1210. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1212 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu-ray Disc™), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 1200.
The components 1205 to 1213 of the computer module 1201 typically communicate via an interconnected bus 1204 and in a manner that results in a conventional mode of operation of the computer system 1200 known to those in the relevant art. For example, the processor 1205 is coupled to the system bus 1204 using a connection 1218. Likewise, the memory 1206 and optical disk drive 1212 are coupled to the system bus 1204 by connections 1219. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple Mac™ or a like computer systems.
The methods of determining a configuration for inter-processor communication for a heterogeneous multi-processor system may be implemented using the computer system 1200 wherein the processes of
The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 1200 from the computer readable medium, and then executed by the computer system 1200. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 1200 may effect an apparatus for determining a configuration for inter-processor communication for a heterogeneous multi-processor system.
The software 1233 is typically stored in the HDD 1210 or the memory 1206. The software is loaded into the computer system 1200 from a computer readable medium, and executed by the computer system 1200. Thus, for example, the software 1233 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1225 that is read by the optical disk drive 1212. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 1200 may effect an apparatus for determining a configuration for inter-processor communication for a heterogeneous multi-processor system.
In some instances, the application programs 1233 may be supplied to the user encoded on one or more CD-ROMs 1225 and read via the corresponding drive 1212, or alternatively may be read by the user from the networks 1220 or 1222. Still further, the software can also be loaded into the computer system 1200 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 1200 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-ray Disc™, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1201. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 1201 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like.
The second part of the application programs 1233 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 1214. Through manipulation of typically the keyboard 1202 and the mouse 1203, a user of the computer system 1200 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 1217 and user voice commands input via the microphone 1280.
When the computer module 1201 is initially powered up, a power-on self-test (POST) program 1250 executes. The POST program 1250 is typically stored in a ROM 1249 of the semiconductor memory 1206 of
The operating system 1253 manages the memory 1234 (1209, 1206) to ensure that each process or application running on the computer module 1201 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 1200 of
As shown in
The application program 1233 includes a sequence of instructions 1231 that may include conditional branch and loop instructions. The program 1233 may also include data 1232 which is used in execution of the program 1233. The instructions 1231 and the data 1232 are stored in memory locations 1228, 1229, 1230 and 1235, 1236, 1237, respectively. Depending upon the relative size of the instructions 1231 and the memory locations 1228-1230, a particular instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 1230. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1228 and 1229.
In general, the processor 1205 is given a set of instructions which are executed therein. The processor 1205 waits for a subsequent input, to which the processor 1205 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 1202, 1203, data received from an external source across one of the networks 1220, 1222, data retrieved from one of the storage devices 1206, 1209 or data retrieved from a storage medium 1225 inserted into the corresponding reader 1212, all depicted in
The disclosed communications determining arrangements use input variables 1254, which are stored in the memory 1234 in corresponding memory locations 1255, 1256, 1257. The arrangements produce output variables 1261, which are stored in the memory 1234 in corresponding memory locations 1262, 1263, 1264. Intermediate variables 1258 may be stored in memory locations 1259, 1260, 1266 and 1267.
Referring to the processor 1205 of
(i) a fetch operation, which fetches or reads an instruction 1231 from a memory location 1228, 1229, 1230;
(ii) a decode operation in which the control unit 1239 determines which instruction has been fetched; and
(iii) an execute operation in which the control unit 1239 and/or the ALU 1240 execute the instruction.
Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1239 stores or writes a value to a memory location 1232.
Each step or sub-process in the processes of
First Implementation
Each communication channel of an MPSoC may be implemented as either a FIFO or by mapping to a shared cache. The design space of the MPSoC is primarily dependent on the number of communication channels and the number of shared cache configurations. Due to the design space having a large size, it is not feasible to do an exhaustive search for an optimal solution, and a manual approach may not provide satisfactory results. The disclosed method determines the MC-IPrC comprising sizes of the local memories, appropriate implementation of communication channels and shared cache parameters for an input application. Shared cache parameters include cache size, associativity and line size.
An implementation determines MC-IPrC to reduce the on-chip memory area while satisfying performance constraint provided by the designer.
Each of the steps of the method 200 are further described in detail below. Step 202, to determine subgraphs, is described using an example shown in
Step 203 of the method 200 explores the design points in the subgraph design space to determine a memory solution for each of the subgraphs.
The method 500 then proceeds to the step 502, where several variables are initialised by the CPU 1205. Variable CURR_SOL points to the current memory solution of the subgraph and is initialised to a default memory solution in which all the channels are mapped to FIFOs and all processors of the subgraph have associated LMs. For example, in the subgraph table 600 of
Also at step 502, further variables INIT_AREA, PREV_SAVINGS and CURR_SAVINGS are also initialised by the CPU 1205. INIT_AREA represents MPSoC area in which all the channels mapped to FIFOs. PREV_SAVINGS and CURR_SAVINGS are initialised to 0. CURR_SAVINGS and PREV_SAVINGS indicate the savings in the MPSoC area achieved by using the current memory solution and the previous memory solution of the subgraph in comparison to the INIT_AREA. Savings in the area is determined by subtracting the area of the MPSoC based on a subgraph memory solution from the area of the MPSoC with all channels mapped to FIFOs
The method 500 uses a half-interval approach to evaluate the groups of channels. In the half-interval approach, a group of channels of a subgraph, representing a “current interval”, is created by iteratively increasing or decreasing the value of the variable END. It is to be noted that the current interval includes channels with index values ranging from START to END. The initial current interval established at step 502 includes all the channels of the subgraph, where channels are sorted from the highest to the lowest value of communication budget. During the operation of the loop of steps 503-512 to be described, the value of END is iteratively increased or decreased by (N/2i), where ‘N’ is the total number of channels in the subgraph and ‘i’ is the iteration number of the loop. The iteration number, ‘i’, is initialised with the value of 1 and transition of the feedback loop from step 512 to step 503 increments the iteration number by one in the method 500. Channels indicated by a current interval are mapped to each of the caches for performance evaluation. The channels which are not part of the current interval are mapped to FIFOs. If the performance constraints are not satisfied then the value of END is decreased by (N/2i) to remove channels from the current interval. Otherwise, if the performance constraints are met, then the value of END is increased by (N/2i) to add channels to the current interval.
The half-interval approach is further described using a memory solution 700 shown in
The half interval approach is further explained using the
As noted above, the loop of steps 503-512 operates on a single group of channels forming the current interval. The initial current interval comprises all channels of the input subgraph 520. At step 503, the channels indicated by the current interval are each mapped to each of a number of possible solutions of shared cache for performance evaluation. Each possible solution may have only a single shared cache for the subgraph, with the size, values and parameters of the shared cache varying between the possible solutions. In one embodiment, a solution is for the entire graph to be implemented with a single shared cache. At step 504, a cache design space for the channels of the current interval of the subgraph is determined. In this implementation, the cache design space includes variations of cache memory size from half of the data size to the sum of data size and local memories in the current interval of the subgraph. As will be appreciated from the description of steps 507 and 511 where channels are removed from or added to the current interval, the cache memory size will correspondingly vary. Data size for a channel indicates the amount of data transferred in one iteration of a streaming application. For example, the streaming application can be a video processing application with a group of pixels (i.e. a subset of a single video frame) being processed by the MPSoC in each iteration. In the subgraph table 600, data size 605 indicates the amount of data transferred on each channel for each iteration. The cache design space also includes variations of associativity and the line size. As an example, for the subgraph table 600, data size 605 is 4 KB. Cache design space includes variations of cache memory size, for example from 2 KB (half the data size) to 28 KB (the sum discussed above), where cache memory size may be divided in multiple sets governed by the cache associativity. Cache size is incremented based on a predetermined value, for example, 4 KB. An upper limit of the cache associativity and line size variations is provided as input to the method 500 by the designer. In this example, cache associativity variations include 1, 2 and 4 ways. Line size variations include 1 word, 2 words, 4 words and 8 words, where 1 word is 32 bits. For each of the cache size determined, there can be line size and associativity variations. These are very standard terms in relation to caches. Cache size is rounded to the nearest upper power of two in each of the variations in the caches in the design space of a subgraph, since memory blocks on-chip are configured in such size steps. In this implementation, for all the channels which are mapped to a shared cache, LM associated with the corresponding receiver-processor is removed. In another implementation, for all the channels which are mapped to a shared cache, LM associated with the corresponding receiver-processor is reduced to a size specified by the designer. The size of the LM specified by the designer is governed by the input application.
At step 505, performance of the MPSoC is evaluated using simulations. MPSoC performance is evaluated for all the variations of the cache design space at this step. It should be noted that performance evaluation can be done using any of a number of known simulation methods including, but not limited to, RTL simulation, instruction set simulation or simulation using SystemC based models. In addition to simulation methods, cache estimation models can be used to provide an estimate of the performance. Various simulation and estimation models differ in the level of accuracy and associated runtimes.
At step 506, it is ascertained whether any of the design points in the subgraph design space meet the performance constraint. If no design point of the subgraph design space satisfies the performance constraint, then the method 500 proceeds to step 507 where the current interval is updated to reduce the number of channels under consideration by decreasing the value of END. In one embodiment, value of the variable END is decreased by (N/2i), where ‘N’ is the total number of channels in the subgraph and ‘i’ is the iteration number of the method 500. In this fashion, at least one communication channel is removed from the current interval, being that channel whose index is END or adjacent to END, being that or those with the lowest communication budget. Channels with lowest communication budget are removed. This is due to the fact that such will be harder to implement using shared cache.
If the performance constraint is determined in step 506 to be met by at least one cache variation of the subgraph design space for the current interval, the method 500 proceeds to step 508 instead of step 507.
At step 508, the design point of the subgraph design space with the minimum area is selected as the current memory solution of the subgraph. The area for a memory solution is the sum of the shared cache area, FIFO area and LM area. CURR_SAVINGS is determined by subtracting the area of the MPSoC based on a selected memory solution from the INIT_AREA. At step 509, the CPU 1205 evaluates whether CURR_SAVINGS is greater than PREV_SAVINGS or not. If the CURR_SAVINGS is less than the PREV_SAVINGS, then the method 500 is terminated, and the CURR_SOL provides the memory solution for the subgraph. This is because the list of channels from the input subgraphs are ordered and the subgraphs are based upon common data whereupon, under these conditions, savings will only reduce and thus further iterations are counterproductive.
If the CURR_SAVINGS is determined at step 509 to be more than PREV_SAVINGS, then CURR_SOL and PREV_SAVINGS are updated at step 510. PREV_SAVINGS is updated with the value of CURR_SAVINGS and CURR_SOL is updated to indicate the current memory solution of the subgraph.
At step 511 of the method 500, channels are added to the current interval by updating the value of the variable END. The value of the variable END is increased by (N/2i), where ‘N’ is the total number of channels in the subgraph and ‘i’ is the iteration number of the method 500. As a consequence of this update, channels of the input subgraph 520 with higher indices than those just processed are added to group forming the current interval.
Step 512 assesses whether all possible groups of channels have been assessed based on the half-interval approach discussed previously. At step 512, the method is terminated if the value of ‘N/2i’ is less than 1, otherwise, the method 500 returns to step 503 and the value of the iteration number is incremented by 1.
As an example, assuming a subgraph has 4 channels and a first iteration of the method 500 violates the constraints at step 506, then 2 channels are removed at step 507. The second iteration, based on 2 channels in the current interval, then satisfies the constraints, and then adds back at step 511 into the current interval the next highest index channel previously removed. The third iteration assesses the 3 channels of the current interval and two alternatives may occur. In the first alternative, the 3 channels may satisfy the constraint and step 512 terminates resulting in one memory solution with 3 lowest index channels mapped to shared cache and remaining channels mapped to FIFOs. That subgraph memory solution is for a single shared cache representing the 3 lowest index channels. The remaining channel of the subgraph, but not in the current interval, is then implemented by a FIFO. In the second alternative, the third iteration may violate the constraints and removes from the current interval, at step 507, the channel just added. The loop is processed again and the 2 channels of the current interval again satisfy the constraint at step 506, and step 511 does not add any channel due to update criteria not being satisfied. The 2 channels of the current interval mapped to shared cache form part of a memory solution for the input subgraph 520. The remaining 2 channels are implemented using respective FIFOs. As such, with the conclusion for step 512, when satisfied all channels in the current interval are implemented using a shared cache, and any remaining channels are each implemented using a corresponding FIFO.
The method 500 selects one memory solution for the subgraph. Memory solution 700, described earlier, is an example of the outcome of the method 500. Similarly, memory solutions for other subgraphs are determined using the method 500.
After the memory solutions for the subgraphs are determined, the MC-IPrC for the graph representing MPSoC can be determined according to the step 204 of the method 200. Method 800 describes the step 204 in detail. According to the method 800, the memory solution of a subgraph providing maximum area savings is recursively combined with the memory solutions of other subgraphs to determine a MC-IPrC for the graph representing the MPSoC. The method 800 takes a list of subgraphs 809 and their corresponding memory solutions as input. The method 800 starts at step 801 by selecting a first subgraph which provides the maximum area savings from a list of subgraphs 809. The memory configuration for the graph is initialised with the memory solution for the selected subgraph. The memory configuration initialised at the step 801 has channels, which are not part of the selected subgraph, mapped to FIFOs. At step 802, a check is performed to determine if there are any more subgraphs to be assessed. If there are more subgraphs to assess, the method 800 proceeds to step 803 where a second subgraph is selected, based on area savings, from the remaining subgraphs of the list 809. Step 804 assesses a combination cost of the first and second subgraphs. The combination cost measures the impact for the area savings of on-chip memory when combining two memory solutions under a performance constraint. The two memory solutions may be combined to create a new memory solution if the combined memory solution provides greater area savings in comparison to the area savings provided by the memory solution of the first subgraph. For example, a 1st subgraph provides savings considering that 1st subgraph in isolation. A 2nd subgraph provides savings considering the 2nd subgraph in isolation. The solutions of those 2 subgraphs can be combined, according to
In addition to the area savings, memory solutions of the two subgraphs can only be combined if the MPSoC based on the combined memory solution meets the performance constraint. The result of the assessment of the combination cost is checked at the step 805 to see if the combination meets the performance constraint. It should be noted that the channels which are not part of the two subgraphs under consideration are mapped to FIFOs for performance evaluation of the MPSoC. If the performance constraint is not met, then the second subgraph is discarded for further assessment at step 806 and the method returns to the step 802 to assess any remaining subgraphs. Any subgraphs, whose memory solutions cannot be combined with other memory solutions, have all their channels mapped to FIFOs. The combination of subgraphs to determine combined memory solution is repeated until all the subgraphs have been assessed. If the combination of the memory solutions of two subgraphs meets the performance constraint, then a new memory solution is created and a new set of subgraphs is created which includes first and second subgraph at step 807. This new set of subgraphs is treated as the first subgraph at step 802 in the next iteration. The memory configuration for the graph is updated to include the combined memory solution. Once all of the subgraphs have been assessed, step 808 provides the memory configuration for inter-processor communication for the graph representing heterogeneous MPSoC.
For the remaining rows in
Second Implementation
In an alternative implementation, a MC-IPrC is determined to reduce communication energy of the heterogeneous MPSoC while satisfying performance constraints provided by the designer. Communication energy, typically measured in Watts, includes energy consumed by the processors in relation to the data communication and the energy consumed by the on-chip memories part of the memory configuration for inter-processor communication. In this implementation, step 505 of the method 500 is replaced with a step which gathers performance as well as energy consumption values from simulations. It should be noted that a variety of simulations can be used including RTL simulations, instruction set simulation or simulation using SystemC based models. Industry standard tools can be used to extract the energy consumption values of the processors and the memory solutions of the subgraphs. Further, step 508 is replaced with a step to select the memory solution with a minimum communication energy, and step 804 of the method 800 assesses combination cost. In this implementation, combination cost includes the savings in the communication energy provided by a combined memory solution of the subgraphs. The combined memory solution is feasible if it provides more communication energy savings when compared to the memory solution of the first subgraph. In addition to that the combined memory solution should also satisfy the performance constraint.
In another implementation, at step 202, subgraphs are determined based on the direct connections between a sender-processor and one or more receiver-processors. Two processors are considered to have a direct connection if any communication channel connects the two processors. For example,
In another implementation, a list of subgraphs is determined from the graph representing MPSoC. A high priority set of subgraphs is determined from the list of the subgraphs based on the savings provided by their memory solutions. The high priority set of subgraphs is determined such that it includes only those subgraphs whose memory solution provides more savings than predetermined threshold savings, which are provided as input by a user. For example, threshold savings can be specified as 0, which implies that all subgraphs whose memory solutions provide positive savings are included in the high priority set of subgraphs.
The arrangements described are applicable to the computer and data processing industries and particularly for the assisted automated design of MPSoC devices. Particularly the arrangements disclosed provided for design and development of heterogeneous multi-processor systems having an inter-processor communication configurations tailored to the specific application of the designer and generally optimised for performance with the available chip space.
The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive.
Yachide, Yusuke, Javaid, Haris, Parameswaran, Sridevan, Batra, Kapil, Shwe, Su Myat Min
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
8230395, | Sep 20 2004 | The MathWorks, Inc. | Memory mapping for single and multi-processing implementations of code generated from a block diagram model |
20090198958, | |||
20110019669, | |||
20110307663, | |||
20130061020, | |||
20160092362, | |||
20160218963, | |||
EP132500992, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Jun 10 2015 | Canon Kabushiki Kaisha | (assignment on the face of the patent) | / | |||
Jul 06 2015 | YACHIDE, YUSUKE | Canon Kabushiki Kaisha | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036595 | /0311 | |
Jul 07 2015 | JAVAID, HARIS | Canon Kabushiki Kaisha | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036595 | /0311 | |
Jul 08 2015 | BATRA, KAPIL | Canon Kabushiki Kaisha | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036595 | /0311 | |
Jul 22 2015 | MIN SHWE, SU MYAT | Canon Kabushiki Kaisha | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036595 | /0311 | |
Aug 18 2015 | PARAMESWARAN, SRIDEVAN | Canon Kabushiki Kaisha | ASSIGNMENT OF ASSIGNORS INTEREST SEE DOCUMENT FOR DETAILS | 036595 | /0311 |
Date | Maintenance Fee Events |
Nov 21 2023 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Jun 16 2023 | 4 years fee payment window open |
Dec 16 2023 | 6 months grace period start (w surcharge) |
Jun 16 2024 | patent expiry (for year 4) |
Jun 16 2026 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jun 16 2027 | 8 years fee payment window open |
Dec 16 2027 | 6 months grace period start (w surcharge) |
Jun 16 2028 | patent expiry (for year 8) |
Jun 16 2030 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jun 16 2031 | 12 years fee payment window open |
Dec 16 2031 | 6 months grace period start (w surcharge) |
Jun 16 2032 | patent expiry (for year 12) |
Jun 16 2034 | 2 years to revive unintentionally abandoned end. (for year 12) |