Optimal DMA scatter read strategy pass #

Looking to optimize FPGA DMA card reads by using scatter reads, etc.

Working out a fairly optimal way to read via FPGA DMA card target process memory. The scenario is tyically small reads of maybe a 100 or more at the time.

Optimal'ish read rules: A) Faster to read in a batch using VMMDLL_MemReadScatter B) VMMDLL_MemReadScatter reads can not bleed from one page to another, minimal size is 8 bytes, and can only read one 4096 (one page) max at the time. Need to to either breakdown and mangange into a series of smaller MEM_SCATTER and/or do additional seperate series of VMMDLL_MemRead type reads. C) Better to read in 4096 (page size) chunks vs a bunch of small reads. D) Best to scatter read from addresses that are close to each other. E) Can prefetch pages to read via a series of VMMDLL_MemPrefetchPages for performance.

TODO: Is VMMDLL_MemReadScatter or the VMMDLL_Scatter_ExecuteRead use setup faster? TODO: More optimal on page (or some other higher) aligned addresses? TODO: Scatter reads are limited to only 12 max reads per batch?

C/C++ pseudocode: Required components:

A std::list<MEM_SCATTER> to build up the scatter read list. In a linked list to make it optimal for suffling/sorting around, or could be in a std::vector<MEM_SCATTER> so it can be fed into VMMDLL_MemReadScatter. Vs commiting list to vector at the end.
Novel(maybe) Peephole optimization: A "Peephole" like optimization that will coalesce groups of small reads into larger up to page sized reads. Optimizes by turning a bunch of small reads into a single larger page block read. Uses system of containers to encapsilate and manage the sub-MEM_SCATTER cotainers. At the top level will be a container that is has/or sub-classes a single MEM_SCATTER. Then an subordinate internal array of MEM_SCATTER containers that describe the reads in the large block source container.

And at least even without the Peephole optimization, this setup is requred to handle reads that are larger than a page, and reads that land on page boundries. Lets call this system "MemPeephole" for short.

Read poll steps:

Organize the series of desiried reads into a MEM_SCATTER (or a smaller local container) linked list (for fast sorting suffling around). Or diectly for the final std::vector<MEM_SCATTER> as that's what is needed to feed to VMMDLL_MemReadScatter.
Sort by ascending addreses (covers rule 'D' and sets up for following rules).
MemPeephole: Iterate the list, breaking down and small MEM_SCATTER into one large MEM_SCATTER block read. Also handles the case reads accross page boundries and for reads larger than a page. Once this pass is done, typicall the MEM_SCATTER list will be much smaller since we reduced so many sub-reads. Covers rules 'B' and 'C' (still maintaining 'D').
Build a temp list of pages based on the read list and feed the page addresses to VMMDLL_MemPrefetchPages for a potential boost; rule 'E'.
Feed the list to VMMDLL_MemReadScatter and wait for all the reads to complete. TODO: The function is apparently asyncronious so might need to add a WaitForScatterReads function to VMM, or maybe the VMMDLL_Scatter_ExecuteRead series are the ones to use.
From the MemPeephole series reads, copy the memory from the block reads into a series of small py pointer ones.

For the code that needs the data, could be maintained in these lists by pointer avoiding unessary memory copies.