Optimal DMA scatter read strategy pass #
Looking to optimize FPGA DMA card reads by using scatter reads, etc.Working out a fairly optimal way to read via FPGA DMA card target process memory. The scenario is tyically small reads of maybe a 100 or more at the time.
Optimal'ish read rules:
A) Faster to read in a batch using VMMDLL_MemReadScatter
B) VMMDLL_MemReadScatter
reads can not bleed from one page to another, minimal size is 8 bytes, and can only read one 4096 (one page) max at the time.
Need to to either breakdown and mangange into a series of smaller MEM_SCATTER
and/or do additional seperate series of VMMDLL_MemRead
type reads.
C) Better to read in 4096 (page size) chunks vs a bunch of small reads.
D) Best to scatter read from addresses that are close to each other.
E) Can prefetch pages to read via a series of VMMDLL_MemPrefetchPages
for performance.
TODO: Is VMMDLL_MemReadScatter
or the VMMDLL_Scatter_ExecuteRead
use setup faster?
TODO: More optimal on page (or some other higher) aligned addresses?
TODO: Scatter reads are limited to only 12 max reads per batch?
C/C++ pseudocode: Required components:
-
A
std::list<MEM_SCATTER>
to build up the scatter read list. In a linked list to make it optimal for suffling/sorting around, or could be in astd::vector<MEM_SCATTER>
so it can be fed intoVMMDLL_MemReadScatter
. Vs commiting list to vector at the end. -
Novel(maybe) Peephole optimization: A "Peephole" like optimization that will coalesce groups of small reads into larger up to page sized reads. Optimizes by turning a bunch of small reads into a single larger page block read. Uses system of containers to encapsilate and manage the sub-MEM_SCATTER cotainers. At the top level will be a container that is has/or sub-classes a single
MEM_SCATTER
. Then an subordinate internal array ofMEM_SCATTER
containers that describe the reads in the large block source container.
And at least even without the Peephole optimization, this setup is requred to handle reads that are larger than a page, and reads that land on page boundries. Lets call this system "MemPeephole" for short.
Read poll steps:
-
Organize the series of desiried reads into a
MEM_SCATTER
(or a smaller local container) linked list (for fast sorting suffling around). Or diectly for the finalstd::vector<MEM_SCATTER>
as that's what is needed to feed toVMMDLL_MemReadScatter
. -
Sort by ascending addreses (covers rule 'D' and sets up for following rules).
-
MemPeephole: Iterate the list, breaking down and small
MEM_SCATTER
into one largeMEM_SCATTER
block read. Also handles the case reads accross page boundries and for reads larger than a page. Once this pass is done, typicall theMEM_SCATTER
list will be much smaller since we reduced so many sub-reads. Covers rules 'B' and 'C' (still maintaining 'D'). -
Build a temp list of pages based on the read list and feed the page addresses to
VMMDLL_MemPrefetchPages
for a potential boost; rule 'E'. -
Feed the list to
VMMDLL_MemReadScatter
and wait for all the reads to complete. TODO: The function is apparently asyncronious so might need to add aWaitForScatterReads
function to VMM, or maybe theVMMDLL_Scatter_ExecuteRead
series are the ones to use. -
From the MemPeephole series reads, copy the memory from the block reads into a series of small py pointer ones.
For the code that needs the data, could be maintained in these lists by pointer avoiding unessary memory copies.