## Multi-Adapter Integrated + Discrete GPUs Allen Hux, Intel

Intel GameDev **BOOST** 



## Legal Notices and Disclaimers

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.

Intel, Core and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

\*Other names and brands may be claimed as the property of others

© Intel Corporation.







### Agenda

Opportunity: Integrated + Discrete D3D12 Multi-Adapter Background Practical Asymmetrical Multi-GPU Results Conclusion & Call to Action References



# Integrated Graphics Opportunity

- Many gaming PCs have both integrated and discrete GPUs
- Usually the integrated is idle
- Integrated graphics is a lot of compute!
  - ... and, we have an extension that can help extract more performance
- However, D3D12 multi-adapter has many pitfalls

For a class of algorithms, there is a recipe for tapping the integrated GPU for more performance with modest engineering effort.





## D3D12 Multi-Adapter Support

2 ways D3D supports Multiple GPUs:

#### 1. LDA: Linked Display Adapter

- Appears as one adapter (D3D Device) with multiple nodes
- Transparently copy or use\* resources across/between nodes
- Typically "symmetrical" i.e. identical GPUs
- 2. Explicit Multiple Adapter
  - Cross-Adapter Shared resources with many restrictions
  - May be "asymmetrical" this is what we're doing



## Multi-Adapter Approaches

Share Rendering: Split Frame, Alternate Frame, CheckerboardLow ROI for asymmetric GPUs

Post-Processing: CMAA, SSAO, Camera effects...

- Requires crossing PCI bus twice
- Occlusion Culling, Physics, AI
  - Producer-consumer
  - Even better when running async from rendering



## Multi-Adapter Approaches

Share Rendering: Split Frame, Alternate Frame, CheckerboardLow ROI for asymmetric GPUs

Post-Processing: CMAA, SSAO, Camera effects...

- Requires crossing PCI bus twice
- Occlusion Culling, Physics, AI
  - Producer-consumer
  - Even better when running async from rendering



## **Platform Overview**





## **Platform Overview**







## **Platform Overview**



integrated graphics memory *is* system memory iGPU can use cross-adapter shared resources with little/no penalty

Intel GameDev **B00ST** 



## Driving Workload: <u>Microsoft D3D12 n-body particle sim</u>

#### Uses Async Compute

• Separate Render, Compute Queues

#### Modifications for this talk:

- multi-adapter
- Only 1 gravity source
  - O(n) instead of O(n^2)

## Caveat: atypical graphics workload:

• All alpha + geometry shader, No depth







### Agenda

Opportunity: Integrated + Discrete D3D12 Multi-Adapter Background Practical Asymmetrical Multi-GPU Results Conclusion & Call to Action References

12

## D3D12 Cross-Adapter Resources

- Resources are allocated by D3D Device -> bound to adapter
- How to move data between adapters?



## D3D12 Cross-Adapter Resources

- Resources are allocated by D3D Device -> bound to adapter
- How to move data between adapters?
- Shared, Cross-Adapter Resources
- Must be *Placed* in a Cross Adapter Shared Heap

# Heap Creation 1, 2, 3

1. Get aligned data size

2. Create shared heap on any device

3. Create handle

```
const UINT dataSize = m_numParticles * sizeof(Render::Particle);
```

```
D3D12_RESOURCE_DESC crossAdapterDesc =
CD3DX12_RESOURCE_DESC::Buffer(dataSize,
D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS |
D3D12_RESOURCE_FLAG_ALLOW_CROSS_ADAPTER);
```

D3D12\_RESOURCE\_ALLOCATION\_INFO textureInfo =
 m\_device->GetResourceAllocationInfo(0, 1, &crossAdapterDesc);

```
UINT64 alignedDataSize = textureInfo.SizeInBytes;
```

CD3DX12\_HEAP\_DESC heapDesc( m\_NUM\_BUFFERS \* alignedDataSize, D3D12\_HEAP\_TYPE\_DEFAULT, 0, // An alias for 64KB. See documentation for D3D12\_HEAP\_DESC D3D12\_HEAP\_FLAG\_SHARED | D3D12\_HEAP\_FLAG\_SHARED\_CROSS\_ADAPTER);

ThrowIfFailed(m\_device->CreateHeap(&heapDesc, IID\_PPV\_ARGS(&m\_sharedHeap)));

m\_sharedHandles.m\_alignedDataSize = alignedDataSize;



## **Cross-Adapter Resource Creation**

• Open handle on 2nd device

 Both adapters: create placed resources within the crossadapter heap

• Use same alignment and size

```
ID3D12Heap* pSharedHeap = 0;
m device->OpenSharedHandle(in sharedHandles.m heap,
    IID PPV ARGS(&pSharedHeap));
D3D12 RESOURCE DESC crossAdapterDesc =
    CD3DX12_RESOURCE_DESC::Buffer(in_sharedHandles.m_alignedDataSize,
   D3D12 RESOURCE FLAG ALLOW UNORDERED ACCESS
    D3D12 RESOURCE FLAG ALLOW CROSS ADAPTER);
for (UINT i = 0; i < m NUM BUFFERS; i++)
    ThrowIfFailed(m device->CreatePlacedResource(
        pSharedHeap,
        i * in sharedHandles.m alignedDataSize,
        &crossAdapterDesc,
        D3D12 RESOURCE_STATE_COPY_SOURCE,
        nullptr,
        IID_PPV_ARGS(&m_sharedBuffers[i])));
pSharedHeap->Release();
```



## **Cross-Adapter Resource Restrictions**

- Textures have many restrictions
  - Row-major alignment, displayable with format limitations
  - Possible, and maybe someday efficient
- Focus on Buffers good for async compute scenarios



## Async compute in particle sample

Ping-pong buffers hold source state, destination state

Read initial state Compute next state Render results

Parallelized by rendering prior state while computing next state (async compute)

buffer count matches swap chain length



inte

## Multi-Adapter: Add copy stage

Idea: each adapter ping-pongs

Forms a parallel pipeline:

2 readers of state n compute state n+1 render state n-1





## Multi-Adapter: Pong

#### Swap buffers each frame







## **Resource Allocations**

Render buffers:

Local adapter heap Default, Committed

Adapter discrete memory Adapter preferred layout

Compute buffers:

Cross-adapter heap Cross-adapter, Placed

CPU memory Linear layout Intel GameDev **B00ST** 





# Copy Queue

**Key Insight:** 

Copy Queue on Discrete Adapter

Copy from system memory to discrete

Integrated memory *is* system mem, so this is a logical arrangement.

Explicit copy stage relaxes timing



Note: actually used CopyBufferRegion because aligned data size may be padded

Intel GameDev **BOOST** 

'inte

## Render Time: Max(compute, render, copy)

Frame time is determined by the long pole of three parallel stages

Example shown: 16ms





## **Resource Creation**

#### **Compute Adapter**

```
for (UINT i = 0; i < m_NUM_BUFFERS; i++)</pre>
```

```
ThrowIfFailed(m_device->CreatePlacedResource(
```

```
m_sharedHeap.Get(),
```

```
i * alignedDataSize,
```

```
&crossAdapterDesc,
```

```
D3D12_RESOURCE_STATE_UNORDERED_ACCESS,
```

nullptr,

```
IID_PPV_ARGS(&m_positionBuffers[i])));
```

#### **Render Adapter**

{

}

for (UINT i = 0; i < m\_NUM\_BUFFERS; i++)</pre>

ThrowIfFailed(m\_device->CreatePlacedResource(
 m\_sharedHeap.Get(),
 i \* alignedDataSize,
 &crossAdapterDesc,
 D3D12\_RESOURCE\_STATE\_COPY\_SOURCE,
 nullptr,
 IID\_PPV\_ARGS(&m\_positionBuffers[i])));



## **Cross-Adapter Synchronization**

- Share fence handles once
- Pass event values per-frame

ThrowIfFailed(m\_device->CreateFence(
 m\_fenceValue,
 //D3D12\_FENCE\_FLAG\_NONE,
 D3D12\_FENCE\_FLAG\_SHARED | D3D12\_FENCE\_FLAG\_SHARED\_CROSS\_ADAPTER,
 IID\_PPV\_ARGS(&m\_fence)));

ThrowIfFailed(m\_device->CreateSharedHandle(
 m\_fence.Get(), nullptr, GENERIC\_ALL,
 L"RenderSharedFence", &m\_sharedFenceHandle));

```
// copy simulation results
// copy SimulationResults(UINT64 in_fenceValue, int in_numActiveParticles)
{
    // cross-adapter sync
    // cross-adapter sync
    // copy waits for previous compute to complete
    //------
ThrowIfFailed(m_copyQueue->Wait(m_sharedComputeFence.Get(), in_fenceValue));
```



# Synchronization Cycle of Life

Compute waits on Copy Fence Cross Adapter

Copy waits on Compute Fence Cross Adapter AND render fence Cross Engine

Render waits on Copy Fence Cross Engine

CPU waits on render fence (which covers all fences)

Intel GameDev **BOOST** 





## Intel Extension: Command Queue Throttle

Maintains performance even when load is inconsistent E.g. integrated not 100% active, waiting on discrete

All public Intel drivers support the extension:

D3D12\_COMMAND\_QUEUE\_THROTTLE\_MAX\_PERFORMANCE

Header file from your friendly Intel contact



# Intel Throttle Extension

below: toggling extension on/off



- When integrated idles between commands, clock rate drops
- With extension enabled, commands are executed full-speed
- Your mileage may vary, but this is another tool you can try





#### Agenda

Opportunity: Integrated + Discrete D3D12 Multi-Adapter Background Practical Asymmetrical Multi-GPU

Results

Conclusion & Call to Action References



## **GpuView Shows Stages Run in Parallel**

| 238.1<br>(10ms)                         | 1                 |      |      |
|-----------------------------------------|-------------------|------|------|
| Adapter [Intel(R) UHD                   | Graphics 630]     |      |      |
| Hardware Queue<br><sup>3D</sup>         |                   |      | Тм   |
| Adapter [NVIDIA GeFo                    | orce RTX 2080 Ti] |      |      |
| Hardware Queue                          |                   |      |      |
| Hardware Queue<br><sub>Copy</sub>       |                   |      | W    |
| Hardware Queue<br><sub>Copy</sub>       |                   | <br> | WICO |
| Hardware Queue<br><sub>Graphics_1</sub> |                   |      |      |
|                                         |                   |      | In   |
| 274.4<br>(10ms)                         |                   |      |      |
| Adapter [Intel(R) UHD G                 | raphics 630]      |      |      |
| Hardware Queue<br>3D                    |                   |      |      |
| Flip Queue [0]                          |                   |      |      |
| Overlay<br>Layer 0                      |                   |      |      |
| Adapter [Radeon RX550                   | )/550 Series]     |      |      |
| Hardware Queue<br><sup>3D</sup>         |                   |      |      |
|                                         |                   |      |      |

Two examples, full-screen application

When 3D is dominant, there is time for compute and copy

n some scenarios, Copy can dominate (frame time > render + compute)



30

| Radeon (TM) RX 480 Graph<br>NVIDIA GeForce RTX 2080 | ics Compute |
|-----------------------------------------------------|-------------|
| Intel(R) HD Graphics 530                            |             |
| Intel Q Extension<br>VSync                          |             |
| ✔ FullScreen                                        |             |
| 2.094                                               | Size        |
| 0,236                                               | Intensity   |
| Num Particles                                       |             |
| 4194304                                             | 📕 Rendered  |
| 4194304                                             | 🔲 Copied    |
| 4194304                                             | 📕 Simulated |
| 🗸 Link Sliders                                      |             |
| render ms: 12.949157<br>simulate ms: 9.546483       |             |

frameTime: 13.329394

4M particles Intel HD 530 + AMD RX 480

#### Intel GameDev **BOOST**

#### Queues and Adapters run in-sync

GPUView: Merged.etl STime=34899328 Duration=1607928

|   | <u>File View Tools Charts</u> | Options <u>H</u> elp |   |   |     |
|---|-------------------------------|----------------------|---|---|-----|
|   | 349.0<br>(10ms)               |                      | T | I | 1 1 |
|   | Adapter [Intel(R) HD Grap     | ohics 530]           |   |   |     |
|   | Hardware Queue<br>3D          |                      |   |   |     |
|   | Flip Queue [0]                |                      |   |   |     |
|   | Overlay                       |                      |   |   |     |
| - | Layer 0                       |                      |   |   |     |
|   | Adapter [Radeon (TM) RX       | 480 Graphics]        |   |   |     |
|   | Hardware Queue<br>3D          |                      |   |   |     |
|   |                               |                      |   |   |     |
|   | Hardware Queue                |                      |   |   |     |
|   | Сору                          |                      |   |   |     |





### Agenda

Opportunity: Integrated + Discrete D3D12 Multi-Adapter Background Practical Asymmetrical Multi-GPU Results

**Conclusion & Call to Action** 

References



## Observations

This technique is best when:

- Render GPU is saturated
- Pure producer-consumer (data crosses bus only once)
- Task can be completely offloaded (no collaboration)
- Render not waiting (pipeline has room to breathe)
- Best: compute allowed to take > 1 frame

Many async compute tasks fit this pattern



## Be aware of PCIe bandwidth

- Gen3 x16: 16GB/s
- 4M particles, one float4 each: 64MB
- 16GB / 64MB = 256Hz maximum frame rate
  - Some GPUs/configs are x8: half bandwidth

Keep data transfer size as low as possible!

Splitting data buffers by usage has perf benefit



# Low Code Complexity

Essentially an enhancement of async compute

#### Simplifies transition barriers (vs. single adapter)

- Copy Queue benefits from Implicit State Transitions
  - No transitions to/from COPY\_DEST or COPY\_SOURCE
- Each Adapter/Queue views resource exactly as it needs it
  - No transitions between UAV or SRV

#### Little specific cross-adapter code

- Shared Resources from only one adapter
- Share fence(s)



## Call to Action

This recipe works for more than particles

Could be physics, mesh deformation, AI, shadows

Many async compute tasks fit this pattern

Check for Intel integrated graphics!







#### Agenda

Opportunity: Integrated + Discrete D3D12 Multi-Adapter Background Practical Asymmetrical Multi-GPU Results Conclusion & Call to Action

References



37

## References

- Intel<sup>®</sup> Devmesh
- <u>Multi-Adapter-Particles Sample code on Github</u>
- <u>Microsoft® n-Body Gravity Sample</u>
- <u>GPUOpen nBody Async Sample</u>



