http://arxiv.org/abs/2303.06689
https://arxiv.org/abs/2203.11171
https://arxiv.org/abs/2303.17071
http://arxiv.org/abs/2303.17071v1
https://arxiv.org/abs/2211.01910
https://arxiv.org/abs/2305.02897
https://arxiv.org/pdf/2108.08877.pdf
https://arxiv.org/pdf/2006.08671.pdf
https://arxiv.org/pdf/2009.09634.pdf
https://arxiv.org/abs/2304.11477
http://arxiv.org/abs/2303.09014
http://arxiv.org/abs/2303.17491
http://arxiv.org/abs/2304.05128
http://arxiv.org/abs/2304.08103
https://arxiv.org/ftp/arxiv/papers/2303/2303.17482.pdf
https://paperswithcode.com/dataset/dr-bench
https://arxiv.org/abs/2205.11916
http://arxiv.org/abs/1901.02860
https://arxiv.org/abs/2207.06881
https://paperswithcode.com/method/absolute-position-encodings
https://arxiv.org/abs/2205.05131v1
https://arxiv.org/abs/2305.01625
https://arxiv.org/pdf/2304.15004.pdf
https://arxiv.org/abs/2302.07842
https://openreview.net/forum?id=1ikK0kHjvj
https://openreview.net/pdf?id=ByME42AqK7
https://arxiv.org/abs/2104.09864
https://arxiv.org/abs/2112.09118
https://arxiv.org/abs/2305.06161
https://arxiv.org/abs/2305.08291
https://log10.io/
https://neuralmagic.com/deepsparse/
https://gpt-index.readthedocs.io/
https://python.langchain.com/
https://crfm.stanford.edu/helm/latest
https://instruction-tuning-with-gpt-4.github.io/
https://github.com/project-baize/baize-chatbot
https://github.com/curai/curai-research/tree/main/DERA
https://github.com/bhargaviparanjape/language-programmes/
https://github.com/ofirpress/attention_with_linear_biases
https://github.com/lucidrains
https://github.com/lm-sys/
https://github.com/Alignment-Lab-AI/FOP
https://github.com/Alignment-Lab-AI/DRCLib
https://github.com/Alignment-Lab-AI/TEAL
https://github.com/Alignment-Lab-AI/dino
https://github.com/Alignment-Lab-AI/mt-dnn
https://github.com/Alignment-Lab-AI/identifiable-transformers
https://github.com/Alignment-Lab-AI/FLAN
https://github.com/Alignment-Lab-AI/LMOps
https://github.com/Lightning-AI/lightning
https://github.com/booydar/LM-RMT
https://github.com/lucidrains/recurrent-memory-transformer-pytorch
https://github.com/Cranial-XIX/llm-pddl
https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-chat
https://github.com/microsoft/deepspeed-mii
https://github.com/hpcaitech/ColossalAI
https://github.com/MouseAndKeyboard/SmartGPT
https://github.com/Alignment-Lab-AI/task_vectors
https://github.com/smol-ai/developer
https://github.com/IBM/Dromedary/tree/main
https://github.com/microsoft/TaskMatrix/tree/main/LowCodeLLM
https://github.com/microsoft/guidance
https://github.com/posgnu/rci-agent
https://github.com/lucidrains/recurrent-memory-transformer-pytorch
https://github.com/Cranial-XIX/llm-pddl
https://github.com/bhargaviparanjape/language-programmes
https://github.com/curai/curai-research/tree/main/DERA
https://github.com/chakkaradeep/pyCodeAGI/
https://github.com/mosaicml/streaming
https://github.com/AkariAsai/learning_to_retrieve_reasoning_paths
https://github.com/bigcode-project/Megatron-LM
https://github.com/CarperAI/trlx/tree/main/examples/randomwalks
https://huggingface.co/datasets/codeparrot/github-code-clean
https://huggingface.co/facebook/galactica-120b
https://huggingface.co/Udoy/tomekkorbak-python-github-code-tokenizer
https://huggingface.co/groov/code-search-net-tokenizer
https://huggingface.co/tczhang/sample-python-code-tokenizer
https://huggingface.co/datasets/facebook/content_rephrasing
https://huggingface.co/datasets/universal_morphologies
https://huggingface.co/models?dataset=dataset:empathetic_dialogues
https://huggingface.co/CarperAI/FIM-NeoX-1.3B
https://huggingface.co/dorkai/codeX-1.0
https://huggingface.co/datasets/baizhi002/python3.10.8
https://huggingface.co/datasets/datablations/python-megatron
https://huggingface.co/datasets/Nan-Do/instructional_code-search-net-python
https://huggingface.co/datasets/trelent/the-stack-dedup-python-docstrings-1.0-percent-unified
https://huggingface.co/datasets/reshinthadith/synthetic_program_synthesis_python_1M
https://huggingface.co/datasets/loubnabnl/python_comment_code_ratio_08
https://huggingface.co/datasets/Dahoas/code-review-instruct-critique-revision-python
https://huggingface.co/datasets/Nan-Do/code-search-net-python
https://huggingface.co/datasets/Sridevi/python_textbooks
https://huggingface.co/datasets/sia-precision-education/pile_python
https://huggingface.co/datasets/notional/notional-python
https://huggingface.co/datasets/semeru/code-text-python
https://huggingface.co/datasets/formermagic/github_python_1m
https://huggingface.co/datasets/Fraser/python-state-changes
https://huggingface.co/datasets/bigcode/the-stack-dedup
https://huggingface.co/datasets/EleutherAI/arithmetic
https://huggingface.co/datasets/stanfordnlp/SHP
https://huggingface.co/datasets/lighteval/synthetic_reasoning_natural
https://huggingface.co/datasets/lintang/numerical_reasoning_arithmetic
https://huggingface.co/datasets/lighteval/synthetic_reasoning/
https://huggingface.co/datasets/jxu124/llava_complex_reasoning_77k/
https://huggingface.co/wordcab/llama-natural-instructions-13b
https://huggingface.co/datasets/Alignment-Lab-AI/AILabAssistant/blob/main/pretraining%20sets
https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered
https://huggingface.co/datasets/tasksource/mmlu
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/GPTeacher
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/dolly
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/instruct
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/MOSS
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/CodeAlpaca
https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chinese-instruction-collection
more avaliable upon request, sorry for the disorder i wasnt expecting to suddenly have a reason to compile it all in a readable format, there is more the project draws from but many of these contribute very little to the overall structure and this is additionally missing the pretraining data links and much of my own polished dataset made for the purpose due to local storage.
https://www.canva.com/design/DAFjnCZymIo/xH_bnShuEVwZIKVZRIwRtw/edit?utm_content=DAFjnCZymIo&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton
this is a hasty reorganization of the documentation i have been accumulating and studying for the last year or so, thr project has been thuroughly discussed with people of various level of skill and ranking within the industry, many of whom will be avaliable to help fill gaps in my skill as neccesary. ill post a more detailed version on the canva but the rough workflow involes a pretraining set of books structured to teach the logical releationship between human expression, logic, and math within the domain of language, structure based loosely off of the related logic in https://en.wikipedia.org/wiki/Laws_of_Form , (excessively tokenized with performant language and python models, optimized embedding relationships represented initially planned when time was not a factor for model explainability)a blended dataset with weighted ratios of python staging up from general understanding to high accuracy at programming instructions that demonstrate the recursive code skeleton steps (a method for developing code bases in a size agnostic way demonstrated in canva) while mixing in a portion of data through out containing moral heuristics, rlhf, natural instruct, chain of thought, tool use, synthetic logic, abstractive logic, instruction following, kindness, helpfulness, and a bunch of SCIENCE, with seqlength amplitude modulation to hopefully optimize performance on the chosen attention mechanisms if any are used.
i will likely realize i forgot several things after sending this, updates or link to updates will be posted at the canva link until a github is compiled when time allows, ive also been up for a long time so if you see this tonight and its not updated please be patient ive been preparing for the anthropic hackathon. :)
i was making the canva as i saw the link posted fortunately enough! it contains/will contain a flow chart of the initial draft for the training process, scripts used to collect the unique programming training data required, an example of the code handling structure and additional data that is not present in this partial list. including pretraining structure, solution to coding, solution to instruct decoherency, possible solution to hallucination, and new optimization techniques some of which i only fleshed out a few days ago, and predicted results of training. several potential methods for rendering context limit arbitrary with linear scaling to input length.
more documentation is avaliable upon request, including current funding sources, the other open sourced developers involved both directly and tangentially - happy to answer any questions as they come up :)