Unit IX: Introduction to Parallel Processing - Computer Architecture - BCA Notes (Pokhara University)

Breaking

Friday, June 19, 2020

Unit IX: Introduction to Parallel Processing - Computer Architecture

Parallel Processing:

Parallel processing can be described as a class of techniques which enables the system to achieve simultaneous data-processing tasks to increase the computational speed of a computer system. A parallel processing system can carry out simultaneous data-processing to achieve faster execution time. For instance, while an instruction is being processed in the ALU component of the CPU, the next instruction can be read from memory.

The primary purpose of parallel processing is to enhance computer processing capability and increase its throughput, i.e. the amount of processing that can be accomplished during a given interval of time. A parallel processing system can be achieved by having a multiplicity of functional units that perform identical or different operations simultaneously. The data can be distributed among various multiple functional units.

The following diagram shows one possible way of separating the execution unit into eight functional units operating in parallel. The operation performed in each functional unit is indicated in each block if the diagram:

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Fig: Operation Performed In Each Functional Unit

  1. The adder and integer multiplier performs the arithmetic operation with integer numbers.
  2. The floating-point operations are separated into three circuits operating in parallel.
  3. The logic, shift, and increment operations can be performed concurrently on different data. All units are independent of each other, so one number can be shifted while another number is being incremented.

Parallelism in Uniprocessor System:

The Uniprocessor system can perform two or more tasks simultaneously. The tasks are not related to each other so a system that processes two different instructions simultaneously could be considered to perform parallel processing.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Fig: Parallelism in Uniprocessor System

The instruction dispatch unit assigns the current instruction to the relevant unit based on the decoding result. The assigned unit continued with the following steps such as operand addressing calculation, operand fetch, execution, etc. as soon as the current instruction is assigned to the functional unit the instruction dispatcher unit assigns next instruction to another unit when it is free. It can have more than one functional unit of the same type as multiple adders, multiple shifters, etc.

The control unit is more complex than that of the simple processor. However, the throughput achieved is much higher. The control unit should be alert in conflict handling such as dependency cases. For example instruction, A whose operand dependent on the execution of instruction B should wait until the execution of instruction B is completed.

Multiprocessor Systems:

A multiprocessor system is defined as "a system with more than one processor", and more precisely, "a number of central processing units linked together to enable parallel processing to take place". The key objective of a multiprocessor is to boost a system's execution speed. The other objectives are fault tolerance and application matching.

The term "multiprocessor" can be confusing with the term "multiprocessing". While multiprocessing is a type of processing in which two or more processors work together to execute multiple programs simultaneously, multiprocessor refers to a hardware architecture that allows multiprocessing.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

The multiprocessors are further classified into two groups depending on the way their memory is organized. The processors with shared memory are called tightly coupled or shared memory processors. The information in these processors is shared through the common memory. Each of the processors can also have their local memory too.

The another class of multiprocessors is loosely coupled or distributed memory multi-processors. In this, each processor have its own private memory, and they share information with each other through interconnection switching scheme or message passing.

The principal characteristic of a multiprocessor is its ability to share a set of main memory and some I/O devices. This sharing is possible through some physical connections between them called the interconnection structures.

Advantages of Multiprocessor Systems:

1. More Reliable Systems:

In a multiprocessor system, even if one processor fails, the system will not halt. This ability to continue working despite hardware failure is known as graceful degradation. For example: If there are 5 processors in a multiprocessor system and one of them fails, then also 4 processors are still working. So the system only becomes slower and does not ground to a halt.

2. Enhanced Throughput:

If multiple processors are working in tandem, then the throughput of the system increases i.e. number of processes getting executed per unit of time increases. If there are N processors then the throughput increases by an amount just under N.

3. More Economic Systems:

Multiprocessor systems are cheaper than single-processor systems in the long run because they share data storage, peripheral devices, power supplies, etc. If there are multiple processes that share data, it is better to schedule them on multiprocessor systems with shared data than have different computer systems with multiple copies of the data.

Disadvantages of Multiprocessor Systems:

1. Increased Expense:

Even though multiprocessor systems are cheaper in the long run than using multiple computer systems, still they are quite expensive. It is much cheaper to buy a simple single processor system than a multiprocessor system.

2. Complicated Operating System Required:

There are multiple processors in a multiprocessor system that share peripherals, memory, etc. So, it is much more complicated to schedule processes and imparts resources to processes. Then in single-processor systems. Hence, a more complex and complicated operating system is required in multiprocessor systems.

3. Large Main Memory Required:

All the processors in the multiprocessor system share the memory. So a much larger pool of memory is required as compared to single-processor systems.

Flynn’s Classification:

The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn's classification scheme is based on the notion of a stream of information. Two types of information flow into a processor are Instructions and Data. 

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

The instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined as the data traffic exchanged between the memory and the processing unit.

According to Flynn's classification, either of the instruction or data streams can be single or multiple. Computer architecture can be classified into the following four distinct categories:

1. Single Instruction, Single Data (SISD) Systems:

A SISD computing system is a uniprocessor machine that is capable of executing a single instruction, operating on a single data stream. In SISD, machine instructions are processed in a sequential manner, and computers adopting this model are popularly called sequential computers. Most conventional computers have SISD architecture. All the instructions and data to be processed have to be stored in primary memory.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

The speed of the processing element in the SISD model is limited (dependent) by the rate at which the computer can transfer information internally. Dominant representative SISD systems are IBM PC, workstations.

2. Single Instruction, Multiple Data (SIMD) Systems:

A SIMD system is a multiprocessor machine capable of executing the same instruction on all the CPUs but operating on different data streams. Machines based on a SIMD model are well suited to scientific computing since they involve lots of vector and matrix operations. So that the information can be passed to all the processing elements (PEs) organized data elements of vectors can be divided into multiple sets (N-sets for N PE systems) and each PE can process one data set.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Dominant representative SIMD systems are Cray’s vector processing machine.

3. Multiple Instruction, Single Data (MISD) Systems:

An MISD computing system is a multiprocessor machine capable of executing different instructions on different PEs but all of them operating on the same dataset.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Example Z = sin(x) + cos(x) + tan(x)

The system performs different operations on the same data set. Machines built using the MISD model is not useful in most of the application, a few machines are built, but none of them are available commercially.

4. Multiple Instruction, Multiple Data (MIMD) Systems:

A MIMD system is a multiprocessor machine that is capable of executing multiple instructions on multiple data sets. Each PE in the MIMD mode has separate instruction and data streams; therefore machines built using this model are capable of any kind of application. Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

MIMD machines are broadly categorized into shared-memory MIMD and distributed-memory MIMD based on the way PEs are coupled to the main memory.

Interconnection Structures in Multiprocessors:

The components that form a multiprocessor system are CPUs, IOPs connected to input/output devices, and a memory unit. The interconnection between the components can have different physical configurations, depending on the number of transfer paths that are available between the processors and memory in a shared memory system or among the processing elements in a loosely coupled system. There are several physical forms available for establishing an interconnection network.

1. Time Shared Common Bus:

A common-bus multiprocessor system consists of a number of processors connected through a common path to a memory unit. In this interconnection structure, only one processor can communicate with the memory of another processor at any given time. As a consequence, the total overall transfer rate within the system is limited by the speed of the single path.

Transfer operations are conducted by the processor that is in control of the bus at a time. Any other processor wishing to initiate a transfer must first determine the availability status of the bus, and only after the bus becomes available can the processor address the destination unit to initiate the transfer. A command is issued to inform the destination unit of what operation is to be performed.

The receiving unit recognizes its address in the bus and responds to the control signals from the sender, after which the transfer is initiated. The system may exhibit transfer conflicts since one common bus is shared by all processors. These conflicts must be resolved by incorporating a bus controller that establishes priorities among the requesting units.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

A more economical implementation of a dual bus structure is depicted as:

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Here we have a number of local buses each connected to its own local memory and to one or more processors. Each local bus may be connected to a CPU, an lOP, or any combination of processors. A system bus controller links each local bus to a common system bus.

The Input/Output devices connected to the local IOP, as well as the local memory, are available to the local processor. The memory connected to the common system bus is shared by all processors. Only one processor can communicate with the shared memory and other common resources through the system bus at any given time. The other processors are kept busy communicating with their local memory and Input/Output devices.

2. Multiport Memory:

A multiport memory system employs separate buses between each memory module and each CPU. Each processor bus is connected to each memory module. A processor bus consists of the address, data, and control lines required to communicate with memory.

The memory module is said to have four ports and each port accommodates one of the buses. The module must have internal control logic to determine which port will have access to memory at any given time. Memory access conflicts are resolved by assigning fixed priorities to each memory port. Thus CPU 1 will have priority over CPU 2, CPU 2 will have priority over CPU 3, and CPU 4 will have the lowest priority.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

The advantage of the multiport memory organization is the high transfer rate that can be achieved because of the multiple paths between processors and memory.

The disadvantage is that it requires expensive memory control logic and a large number of cables and connectors. As a consequence, this interconnection structure is usually appropriate for systems with a small number of processors.

3. Crossbar Switch:

Crossbar Switch system contains a number of crosspoints that are kept at intersections among memory module and processor buses paths. In each crosspoint, the small square represents a switch which obtains the path from a processor to a memory module. Each switch point has to control logic to set up the transfer path among memory and processor. It calculates the address which is placed on the bus to obtain whether its specific module is being addressed. In addition, it eliminates multiple requests for access to the same memory module on a predetermined priority basis.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

The functional design of a crossbar switch connected to one memory module is shown in the figure. The circuit contains multiplexers that choose the data, address, and control from one CPU for communication with the memory module. Arbitration logic established priority levels to select one CPU when two or more CPUs attempt to access the same memory. The multiplexers can be handled by the binary code which is produced by a priority encoder within the arbitration logic.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

A crossbar switch system permits simultaneous transfers from all memory modules because there is a separate path associated with each module. Thus, the hardware needed to implement the switch may become quite large and complex.

4. Multistage Switching Network:

The basic component of a multistage network is a two-input, two-output interchange switch. As shown in Figure below, the 2 x 2 switch has two inputs, labeled A and B, and two outputs, labeled 0 and 1.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

There are control signals (not shown) associated with the switch that establishes the interconnection between the input and output terminals? The switch has the capability of connecting input A to either of the outputs. Terminal B of the switch behaves in a similar fashion. The switch also has the capability to arbitrate between conflicting requests. If inputs A and B both request the same output terminal, only one of them will be connected; the other will be blocked. Using the 2 x 2 switch as a building block, it is possible to build a multistage network to control the communication between a number of sources and destinations.

To see how this is done, consider a binary tree shown in the Figure below.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

The two processors P1 and P2 are connected through switches to eight memory modules marked in binary from 000 through 111. The path from a source to a destination is determined from the binary bits of the destination number.

The first bit of the destination number determines the switch output on the first level. The second bit specifies the output of the switch in the second level, and the third bit specifies the output of the switch in the third level.

For example, to connect P1 to memory 101, it is necessary to form a path from P to output 1 in the first-level switch, output 0 in the second-level switch, and output 1 in the third-level switch. It is clear that either P1 or P2 can be connected to any one of the eight memories, certain request patterns, however, cannot be satisfied simultaneously. For example, if P1 is connected to one of the destinations 000 through 011, P2 can be connected to only one of the destinations 100 through 111.

Many different topologies have been proposed for multistage switching networks to control processor and memory communication in a tightly coupled multiprocessor system or to control the communication between the processing elements in a loosely coupled system. One such topology is the omega switching network shown in the Figure below.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

In this configuration, there is exactly one path from each source to any particular destination. Some request patterns, however, cannot be connected simultaneously. For example, any two sources cannot be connected simultaneously to destinations 000 through 111.

A particular request is initiated in the switching network by the source, which sends a 3-bit pattern representing the destination number. As the binary pattern moves through the network, each level examines a different bit to determine the 2 x 2 switch setting. Level 1 inspects the most significant bit, level 2 inspects the middle bit, and level 3 inspects the least significant bit.

When the request arrives on either input of the 2 x 2 switch, it is routed to the upper output if the specified bit is 0 or to the lower output if the bit is 1. In a tightly coupled multiprocessor system, the source is a processor and the destination is a memory module. The first pass through the network sets up the path. Succeeding passes are used to transfer the address into memory and then transfer the data in either direction, depending on whether the request is a read or a write. In a loosely coupled multiprocessor system, both the source and destination are processing elements. After the path is established, the source processor transfers a message to the destination processor.

5. Hypercube Interconnection:

The hypercube or binary n-cube multiprocessor structure is a loosely coupled system composed of N = 2^n processors interconnected in an n-dimensional binary cube. Each processor forms a node of the cube. Although it is customary to refer to each node as having a processor, in effect it contains not only a CPU but also local memory and Input/Output interface.

Each processor has direct communication paths to n other neighbor processors. These paths correspond to the edges of the cube. There are 2^n distinct n-bit binary addresses that can be assigned to the processors. Each processor address differs from that of each of its n neighbors by exactly one-bit position.

The figure below shows the hypercube structure for n = 1, 2, and 3.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

A one-cube structure has n = 1 and 2^n = 2. It contains two processors interconnected by a single path. A two-cube structure has n= 2 and 2 ^ 2 = 4. It contains four nodes interconnected as a square. A three-cube structure has eight nodes interconnected as a cube. An n-cube structure has 2^n nodes with a processor residing in each node. Each node is assigned a binary address in such a way that the addresses of two neighbors differ in exactly one-bit position. For example, the three neighbors of the node with address 100 in a three-cube structure are 000, 110, and 101. Each of these binary numbers differs from address 100 by one-bit value.

Routing messages through an n-cube structure may take from one to n links from a source node to a destination node. For example, in a three-cube structure, node 000 can communicate directly with node 001. It must cross at least two links to communicate with 011 (from 000 to 001 to 011 or from 000 to 010 to 011). It is necessary to go through at least three links to communicate from node 000 to node 111. A routing procedure can be developed by computing the exclusive-OR of the source node address with the destination node address. The resulting binary value will have 1 bit corresponding to the axes on which the two nodes differ.

Cache Coherence:

In a multiprocessor system, data inconsistency may occur among adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent copies of the same object.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

As multiple processors operating in parallel, and independently multiple caches may possess different copies of the same memory block, this creates Cache Coherence Problem

Cache Write Policies:

1. Write Back (WB):

Write operations are usually made only to the cache. The main memory is only updated when the corresponding cache line is flushed from the cache.

In the WB protocol, multiple copies of a cache block may exist if different processors have loaded (read) the block into their caches. In this approach, if some processor wants to change this block, it must first become the exclusive owner of the block.

When the ownership is granted to this processor by the memory module that is the home location of the block. All other copies, including the one in the memory module, are invalidated. Now the owner of the block may change the contents of the memory. When another processor wishes to read this block, the data are sent to this processor by the current owner. The data are also sent to the home memory module, which requires ownership and updates the block to contain the latest value.

2. Write Through (WT):

All write operations are made to main memory as well as to the cache, ensuring that the main memory is always valid. There are two fundamental implementations of the WT protocol as:

1.   Write Through With Update Protocol:

When a processor writes a new value into its cache, the new value is also written into the memory module that holds the cache block is changed. Some copies of this block may exist in other caches, these copies must be updated to reflect the change caused by the write operation.

We update the other cache copies by doing a broadcast with the updated data to all processor modules in the system. Each processor module receives the broadcast data, it updates the contents of the affected cache block if this block is present in its cache.

2.   Write Through With Invalidation Of Copies:

When a processor writes a new value into its cache, this value is written into the memory and all other copies in other caches are invalidated. This is also done by broadcasting the invalidation request through the system. All caches receive this invalidation request and the cache which contains the updated data flushes its cache line.

From the above description it is clear that Write back policy results in inconsistency. If two caches contain the same line, and the line is updated in one cache, the other cache will unknowingly have an invalid value. Subsequently, read to that invalid line produce invalid results.

But if we think deeper even the Write through policy also has consistency issues. Even though memory is updated inconsistency can occur unless other cache monitors the memory traffic or receive some direct notification of the update.

The Solution of Cache Coherence Problem:

1. Software Level Solution - Compiler-Based Cache Coherence Mechanism:

In the software approach, we try to detect the potential code segments which might cause cache coherence issues and treat them separately. Potential cache coherence problem is transferred from run time to compile-time and the design complexity is transferred from hardware to software.

Downside of this approach is in the compile-time; software approaches generally make conservative decisions, which leads to inefficient cache utilization. Compiler-based cache coherence mechanism perform an analysis on the code to determine which data items may become unsafe for caching, and they mark those items accordingly.

So, there are some more cacheable items, and the operating system or hardware does not cache those items. This the simplest approach to prevent any shared data variables from being cached. But this, not an optimal approach and too conservative, because a shared data structure may be exclusively used during some periods and maybe effectively read-only during other periods. It is only during the exclusively accessed periods the cache coherence issue occurs.

More efficient approaches analyze the code to determine safe periods for shared variables. Then the compiler inserts instructions into generated code to enforce cache coherence during the critical periods.

2. Hardware Level Solutions:

Hardware-level solutions are way better than the software level solutions as they provide dynamic recognition at run time of potential inconsistency conditions. Because the problem is only dealt with when it actually arises, there is a more effective use of caches, leading to improved performances over a software approach. Hardware schema can be divided into two categories:

1.   Directory Protocols:

In this approach we collect and maintain information about where copies of lines reside in each cache. Typical implementation has a centralized controller which is a part of the main memory controller. There is a directory that is stored in the main memory which contains global state information about the contents of the various local caches.

When an individual cache controller makes a request, the centralized controller checks and issues necessary commands for data transfer between memory and caches or between caches themselves. It is also responsible for keeping the state information up to date, therefore, every local action that can affect the global state of a line must be reported to the central controller. The controller maintains information about which processors have a copy of which lines. Before a processor can write to a local copy of a line, it must request exclusive access to the line from the controller.

Let’s see the flow when an individual cache controller tries to update its local copy of the cache line.

  1. Local cache controller request for an exclusive access to the line from the centralized cache controller
  2. Before granting the exclusive access, the controller sends a message to all processors with a cached copy of this time, forcing each processor to invalidate its copy.
  3. The centralized controller receives the acknowledgment back from each processor
  4. Controller grants exclusive access to the requesting processor.
  5. When another processor tries to read a line that is exclusively granted to another processor, this results in a cache miss, it will send a miss notification to the controller.
  6. Upon receiving the cache miss notification controller will issue a command to the processor holding the exclusive access to write-back to main memory.

Drawbacks of Directory Schema:

  1. Central cache controller bottleneck.
  2. Communication overhead between various cache controllers and the central controller.

2.   Snoopy Protocols:

Snoopy protocols distribute the responsibility for maintaining cache coherence among all of the cache controllers in a multiprocessor system. A cache must recognize when a line that it holds is shared with other caches. When an update action is performed on a shared cache line, it must be announced to all other caches by a broadcast mechanism.

Each cache controller is able to “snoop” on the network to observe this broadcasted notification and react accordingly. Snoopy protocols are ideally suited to a bus-based multiprocessor because the shared bus provides a simple means for broadcasting and snooping. There are two basic approaches to snoopy protocols explored:

1. Write-Invalidate Protocol:

There can be multiple readers but only one write at a time. Initially, a line may be shared among several caches for reading purposes. When one of the caches wants to perform a write to the line it first issues a notice that invalidates that tine in the other caches, making the line exclusive to the writing cache. Once the line is exclusive, the owning processor can make local writes until some other processor requires the same line.

2. Write With Update Protocol:

There can be multiple writers as well as multiple readers. When a processor wishes to update a shared line, the word to be updated is distributed to all others, and caches containing that line can update it.

Introduction to Vector Processing and Array Processors:

1. Vector Processing:

Vector processing performs the arithmetic operation on the large array of integers or floating-point numbers. Vector processing operates on all the elements of the array in parallel providing each pass is independent of the other. Vector processing avoids the overhead of the loop control mechanism that occurs in general-purpose computers.

Computers with vector processing is able to handle large instruction and they have application in the following fields:

  1. Long Range Weather Forecasting
  2. Petroleum Exploration
  3. Seismic Data Analysis
  4. Medical Diagnosis
  5. Aerodynamics And Space Simulation
  6. Artificial Intelligence And the Expert System
  7. Mapping The Human Genome
  8. Image Processing

Vector Operation:

A vector ‘V’ of length ‘n’ is represented as row vector by V = [V1, V2, V3, … Vn]. The element Vi of vector ‘V’ is written as V (I) and the index ‘I’ refers to a memory address or register where the number is stored.

Let us consider the program in assembly language that two vectors A and B of length 100 and put the result in vector C.

             Initialize I = 0
20        Read A (I)
Read B (I)
Store C (I) = A (I) + B (I)
Increment I = I + 1
If I <= 100 go to 20
Continue

A computer capable of vector processing eliminates the overhead associated with the time it takes to fetch and execute the instructions in the program loop. It allows operations to be specified with a single vector instruction of the form: C (1:100) = A (1:100) + B (1:100)

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Vector Arithmetic Unit:

Vector arithmetic unit is used to perform different arithmetic operations in parallel. A vector arithmetic unit contains multiple functional units. Some performs addition, other subtraction and in similar manner other performs different operations.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

To add two numbers the control unit routes these value, to an adder unit.

For the operations A ßB + C, D ß E – F, the CPU operations route B and C to adder and E and F to subtractor. This allows CPU to execute both instructions simultaneously.

2. Array Processor:

An array processor is a processor that performs the computations on large arrays of data. There are two different types of array processor:

1. Attached Array Processor:

It is designed as a peripheral for a conventional host computer. Its purpose is to enhance the performance of the computer by providing vector processing. It achieves high performance by means of parallel processing with multiple functional units.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

2. SIMD Array Processor:

It is processor which consists of multiple processing unit operating in parallel. The processing units are synchronized to perform the same task under control of common control unit. Each processor elements (PE) includes an ALU, a floating point arithmetic unit and working register.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Why use the Array Processor?

  1.  Array processors increases the overall instruction processing speed.
  2. As most of the Array processors operates asynchronously from the host CPU, hence it improves the overall capacity of the system.
  3. Array Processors has its own local memory, hence providing extra memory for systems with low memory.

Introduction to Multithreaded Architecture:

Multithreaded architectures now appear across the entire range of computing devices, from the highest-performing general-purpose devices to low-end embedded processors. Multithreading enables a processor core to more effectively utilize its computational resources, as a stall in one thread, need not cause execution resources to be idle. This enables the computer architect to maximize performance within area constraints, power constraints, or energy constraints. However, the architectural options for the processor designer or architect looking to implement multithreading is quite extensive and varied, as evidenced not only by the research literature but also by the variety of commercial implementations.

This approach differs from multi-processing as threads have to share the resources of a single or multiple cores. Multi-threading aims to increase the utilization of a single core by using thread-level as well as instruction-level parallelism.

Introduction to Parallel Processing Parallelism in Uniprocessor System Flynn’s Classification Interconnection Structure in Multiprocessors Cache Coherence

Features of Multithreaded Architecture:

  1. In its regular form, a multithreaded processor is made up of many numbers of thread processing elements that are connected to each other with a unidirectional ring which helps in performing the system in a better way.
  2. All the multiple thread processing elements have their own private level-one instruction cache, but they’ll share the level-one data cache and the level-two cache.
  3. In the multithreaded systems also some shared register files present that used for maintaining some global registers and a lock register.
  4. During run-time, the multiple thread processing elements, each has its own program counter and instruction execution path, which can then easily fetch and execute instructions from multiple program locations simultaneously.
  5. Each of the present thread processing element always has a private memory buffer to cache speculative stores and which is also used to support run-time data dependency checking.

No comments:

Post a Comment

If you have any doubt, then don't hesitate to drop comments.