Optimal DMA scatter read strategy pass #
Looking to optimize FPGA DMA card reads by using scatter reads, etc.Working out a fairly optimal way to read via FPGA DMA card target process memory. The scenario is tyically small reads of maybe a 100 or more at the time.
Optimal'ish read rules:
A) Faster to read in a batch using VMMDLL_MemReadScatter
B) VMMDLL_MemReadScatter reads can not bleed from one page to another, minimal size is 8 bytes, and can only read one 4096 (one page) max at the time.
Need to to either breakdown and mangange into a series of smaller MEM_SCATTER and/or do additional seperate series of VMMDLL_MemRead type reads.
C) Better to read in 4096 (page size) chunks vs a bunch of small reads.
D) Best to scatter read from addresses that are close to each other.
E) Can prefetch pages to read via a series of VMMDLL_MemPrefetchPages for performance.
TODO: Is VMMDLL_MemReadScatter or the VMMDLL_Scatter_ExecuteRead use setup faster?
TODO: More optimal on page (or some other higher) aligned addresses?
TODO: Scatter reads are limited to only 12 max reads per batch?
C/C++ pseudocode: Required components:
- 
A std::list<MEM_SCATTER>to build up the scatter read list. In a linked list to make it optimal for suffling/sorting around, or could be in astd::vector<MEM_SCATTER>so it can be fed intoVMMDLL_MemReadScatter. Vs commiting list to vector at the end.
- 
Novel(maybe) Peephole optimization: A "Peephole" like optimization that will coalesce groups of small reads into larger up to page sized reads. Optimizes by turning a bunch of small reads into a single larger page block read. Uses system of containers to encapsilate and manage the sub-MEM_SCATTER cotainers. At the top level will be a container that is has/or sub-classes a single MEM_SCATTER. Then an subordinate internal array ofMEM_SCATTERcontainers that describe the reads in the large block source container.
And at least even without the Peephole optimization, this setup is requred to handle reads that are larger than a page, and reads that land on page boundries. Lets call this system "MemPeephole" for short.
Read poll steps:
- 
Organize the series of desiried reads into a MEM_SCATTER(or a smaller local container) linked list (for fast sorting suffling around). Or diectly for the finalstd::vector<MEM_SCATTER>as that's what is needed to feed toVMMDLL_MemReadScatter.
- 
Sort by ascending addreses (covers rule 'D' and sets up for following rules). 
- 
MemPeephole: Iterate the list, breaking down and small MEM_SCATTERinto one largeMEM_SCATTERblock read. Also handles the case reads accross page boundries and for reads larger than a page. Once this pass is done, typicall theMEM_SCATTERlist will be much smaller since we reduced so many sub-reads. Covers rules 'B' and 'C' (still maintaining 'D').
- 
Build a temp list of pages based on the read list and feed the page addresses to VMMDLL_MemPrefetchPagesfor a potential boost; rule 'E'.
- 
Feed the list to VMMDLL_MemReadScatterand wait for all the reads to complete. TODO: The function is apparently asyncronious so might need to add aWaitForScatterReadsfunction to VMM, or maybe theVMMDLL_Scatter_ExecuteReadseries are the ones to use.
- 
From the MemPeephole series reads, copy the memory from the block reads into a series of small py pointer ones. 
For the code that needs the data, could be maintained in these lists by pointer avoiding unessary memory copies.