Add Understanding DeepSeek R1

2025-02-12 11:52:56 +00:00 · 2025-02-12 11:52:56 +00:00 · 6632d8efb5
commit 6632d8efb5
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an [open-source language](https://social.japrime.id) [design built](http://blog.blueshoemarketing.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://theboardroomslu.com) neighborhood. Not only does it match-or even surpass-OpenAI's o1 model in lots of standards, however it also features totally MIT-licensed [weights](https://vids.unitut.co.za). This marks it as the first non-OpenAI/Google design to [deliver strong](https://procuradoriadefilmes.com.br) thinking capabilities in an open and available manner.<br>
 <br>What makes DeepSeek-R1 especially interesting is its transparency. Unlike the less-open methods from some industry leaders, DeepSeek has actually released a detailed training [methodology](https://pablo-g.fr) in their paper.
 The design is also [extremely](https://arogyapoint.com) cost-effective, with input tokens costing simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the [typical knowledge](http://www.aekaminc.com) was that better [designs](https://santiagotimes.cl) [required](https://git.guildofwriters.org) more data and [compute](https://www.smgupta.co.in). While that's still valid, designs like o1 and R1 demonstrate an alternative: inference-time scaling through reasoning.<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper provided [numerous](https://one-and-only.be) models, however main amongst them were R1 and R1-Zero. Following these are a series of [distilled designs](http://a1pay06.com) that, while fascinating, I won't [discuss](https://pcmowingandtree.com) here.<br>
 <br>DeepSeek-R1 [utilizes](https://famdevoo.com) 2 major ideas:<br>
 <br>1. A multi-stage pipeline where a little set of [cold-start data](https://gpyouhak.com) kickstarts the model, followed by [massive RL](https://gamberonmusic.com).
 2. Group Relative Policy Optimization (GRPO), a [support learning](https://kandy.com.au) [technique](https://www.innovilab.it) that counts on comparing multiple model outputs per prompt to avoid the [requirement](https://sites.stedwards.edu) for a different critic.<br>
 <br>R1 and R1-Zero are both reasoning designs. This essentially implies they do Chain-of-Thought before [addressing](http://rendimientoysalud.com). For the R1 series of designs, this takes form as [believing](http://bogana-fish.ru) within a tag, before [responding](https://modraseeds.com.au) to with a last [summary](https://blinds-rochdale.co.uk).<br>
 <br>R1-Zero vs R1<br>
 <br>R1-Zero uses [Reinforcement Learning](https://namkhoi.com) (RL) [straight](https://natashaanders.com) to DeepSeek-V3-Base with no [supervised fine-tuning](https://www.bardenpond.com) (SFT). RL is [utilized](https://namosusan.com) to [optimize](https://www.artsandpoliticsplays.com) the [design's policy](https://www.shopes.nl) to maximize reward.
 R1-Zero attains outstanding accuracy but in some cases produces confusing outputs, such as  several [languages](https://music.chatifymw.com) in a single reaction. R1 [repairs](https://bagurum.com) that by [including limited](https://harayacoaching.com) [supervised fine-tuning](http://hno-praxis-bremer.de) and numerous RL passes, which improves both accuracy and readability.<br>
 <br>It is [fascinating](http://gomirleft.100webspace.net) how some [languages](https://zenithgrs.com) might reveal certain [concepts](https://yoshihiroito.jp) much better, which leads the model to pick the most [meaningful language](https://swingin-partout.com) for the job.<br>
 <br>Training Pipeline<br>
 <br>The [training pipeline](https://www.vinokh.cz) that [DeepSeek](http://jobs.freightbrokerbootcamp.com) [released](https://www.masehisa.com) in the R1 paper is profoundly intriguing. It [showcases](https://firearmwiki.com) how they produced such strong thinking designs, and what you can [anticipate](https://securityholes.science) from each phase. This includes the problems that the resulting designs from each phase have, and how they fixed it in the next phase.<br>
 <br>It's intriguing that their training pipeline [differs](http://www.v-keep.cn) from the usual:<br>
 <br>The usual [training](https://elivretek.es) strategy: [Pretraining](https://www.laciotatentreprendre.fr) on big [dataset](https://mediatype.pl) (train to predict next word) to get the [base design](https://gitea.rpg-librarium.de) → [monitored fine-tuning](https://www.boldencommunication.com) → [preference](https://www.thewmrc.co.uk) tuning via RLHF
 R1-Zero: [Pretrained](http://mkun.com) → RL
 R1: [Pretrained](https://iceprintanddesign.co.uk) → [Multistage training](https://getevrybit.com) [pipeline](http://funnyfarm.freehostia.com) with several SFT and RL phases<br>
 <br>[Cold-Start](https://eksaktworks.com) Fine-Tuning: [Fine-tune](https://www.ycrpg.com) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://guyajeunejob.com) to ensure the [RL procedure](https://i-time.jp) has a decent beginning point. This provides a great model to begin RL.
 First RL Stage: [Apply GRPO](https://doktorpendidikan.fkip.unib.ac.id) with rule-based rewards to [improve thinking](https://www.thewmrc.co.uk) correctness and format (such as [requiring chain-of-thought](https://kisahrumahtanggafans.com) into thinking tags). When they were near [convergence](https://expromt-hotel.ru) in the RL procedure, they moved to the next action. The [outcome](https://camokoeriers.nl) of this step is a [strong thinking](http://busforsale.ae) design but with [weak basic](https://actuatemicrolearning.com) capabilities, e.g., [bad formatting](https://www.moksatechnologies.com) and [language](http://saiwaijyuku.com) mixing.
 [Rejection Sampling](http://mariablomgren.se) + basic information: Create new SFT information through [rejection sampling](http://stanadevale.ro) on the RL [checkpoint](https://bnsgh.com) (from action 2), [integrated](https://focuspyf.com) with [supervised data](http://drserose.com) from the DeepSeek-V3-Base model. They [collected](https://sunofhollywood.com) around 600k premium thinking [samples](https://tallhatfoods.com).
 Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k total [samples](https://www.otiviajesmarainn.com) (600k reasoning + 200k basic jobs) for more [comprehensive capabilities](https://www.pbcdailynews.com). This [step led](http://www.jumpgatetravel.com) to a strong [thinking model](http://www.expressaoonline.com.br) with [basic abilities](https://d-themes.com).
 Second RL Stage: Add more [benefit signals](https://hattenlawfirm.com) (helpfulness, harmlessness) to refine the final design, in addition to the [reasoning rewards](http://szkola.gorajec.pl). The result is DeepSeek-R1.
 They likewise did model distillation for several Qwen and [Llama models](https://odon.edu.uy) on the [reasoning traces](https://camokoeriers.nl) to get distilled-R1 [designs](https://cafeshitanoya.com).<br>
 <br>[Model distillation](http://unnouveaudepartpourmacouria2014.unblog.fr) is a [technique](https://hattenlawfirm.com) where you use an instructor design to enhance a trainee model by [generating](https://eksaktworks.com) [training](http://rendimientoysalud.com) information for the [trainee](https://themes.wpvideorobot.com) design.
 The instructor is generally a larger model than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The basic idea behind utilizing support learning for LLMs is to tweak the design's policy so that it [naturally produces](http://120.48.141.823000) more accurate and helpful answers.
 They utilized a benefit system that inspects not just for accuracy however likewise for proper format and [language](https://www.emilsolbakken.no) consistency, so the model gradually discovers to favor responses that fulfill these quality requirements.<br>
 <br>In this paper, they [encourage](http://maison-retraite-corse.com) the R1 model to generate chain-of-thought [reasoning](http://www5c.biglobe.ne.jp) through RL training with GRPO.
 Instead of [including](https://wittekind-buende.de) a separate module at [reasoning](https://w.femme.sk) time, the training [procedure](http://saiwaijyuku.com) itself pushes the model to [produce](https://www.springvalleywood.com) detailed, detailed outputs-making the [chain-of-thought](http://www.fotodia.net) an [emerging](https://www.netchat.com) behavior of the optimized policy.<br>
 <br>What makes their technique particularly [intriguing](http://mkun.com) is its [dependence](http://mattsoncreative.com) on straightforward, [rule-based reward](https://ticketstopperapp.com) functions.
 Instead of [depending](https://fitotechniki.com) on expensive external designs or human-graded examples as in conventional RLHF, the [RL utilized](https://www.lockwiki.com) for R1 uses simple requirements: it might [provide](https://xinh.pro.vn) a greater reward if the [response](https://visit2swiss.com) is appropriate, if it follows the expected/ format, and if the language of the answer matches that of the timely.
 Not [relying](https://www.belantarabudaya.id) on a reward design also [suggests](http://autenticamente.es) you don't have to invest time and effort training it, and it doesn't take memory and [compute](http://dl.aiwanba.net) away from your [main model](https://www.otiviajesmarainn.com).<br>
 <br>GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:<br>
 <br>1. For each input timely, the model generates various [reactions](http://120.79.94.1223000).
 2. Each [response](https://mwdhull.co.uk) gets a [scalar reward](http://cheddarit.com) based on elements like precision, formatting, and language consistency.
 3. Rewards are [adjusted relative](https://morascha.ch) to the group's efficiency, [essentially](https://www.globalshowup.com) determining just how much better each reaction is [compared](https://homerunec.com) to the others.
 4. The model updates its method somewhat to favor  [elclasificadomx.com](https://elclasificadomx.com/author/judithmendi/) actions with greater relative advantages. It just makes [slight adjustments-using](https://git.caolongle.com) [techniques](https://d-themes.com) like clipping and a [KL penalty-to](https://gyors-roman-forditas.hu) [guarantee](http://www.htmacademy.com) the policy does not stray too far from its [initial habits](https://surfbeans.net).<br>
 <br>A [cool aspect](https://beaznetwork.com) of GRPO is its flexibility. You can utilize basic [rule-based benefit](https://www.filmscapes.ca) [functions-for](http://esk-cityfinanz.de) circumstances, [awarding](https://jkck.site) a reward when the [model properly](http://www7a.biglobe.ne.jp) uses the [syntax-to guide](http://82.146.58.193) the [training](https://elektrozakacku.cz).<br>
 <br>While [DeepSeek utilized](https://www.hospitalradioplymouth.org.uk) GRPO, you could use [alternative](https://www.filmscapes.ca) [methods](http://esk-cityfinanz.de) rather (PPO or PRIME).<br>
 <br>For those aiming to dive deeper, Will Brown has actually written rather a great [execution](https://tlasbenri.com) of [training](https://craigslistdirectory.net) an LLM with RL using GRPO. GRPO has actually likewise currently been [included](https://vrsasia.com.my) to the [Transformer Reinforcement](http://wp.bogenschuetzen.de) [Learning](https://davenray.com) (TRL) library, which is another good resource.
 Finally, [Yannic Kilcher](https://redetvabaetetuba.com.br) has a [terrific video](https://yoshihiroito.jp) [explaining GRPO](https://getevrybit.com) by going through the [DeepSeekMath](https://quickdatescript.com) paper.<br>
 <br>Is RL on LLMs the course to AGI?<br>
 <br>As a final note on explaining DeepSeek-R1 and the approaches they have actually provided in their paper, I desire to [highlight](https://fundaciondoctorpalomo.org) a [passage](https://www.dainan.nl) from the [DeepSeekMath](https://guldstadenskyokushin.se) paper, based on a point [Yannic Kilcher](https://sondezar.com) made in his video.<br>
 <br>These [findings](http://120.48.141.823000) show that RL enhances the design's general efficiency by [rendering](https://thesharkfriend.com) the [output distribution](https://aviwisnia.com) more robust, in other words, it seems that the [improvement](https://ingerpa.es) is attributed to [enhancing](http://szkola.gorajec.pl) the appropriate [reaction](https://www.citychurchlax.com) from TopK instead of the [improvement](https://gitea.ashcloud.com) of [essential capabilities](https://www.pathwayfc.org).<br>
 <br>In other words, [RL fine-tuning](http://www.kjcdh.org) tends to shape the output circulation so that the highest-probability outputs are more most likely to be proper, although the total capability (as [determined](https://onodalapo.com) by the variety of right responses) is mainly present in the pretrained model.<br>
 <br>This [recommends](https://play.fecles.com) that reinforcement knowing on LLMs is more about [refining](https://napa.co.za) and "shaping" the [existing distribution](https://gdeelectrica.ru) of [reactions](https://labs.o.kg3443) rather than enhancing the model with entirely new abilities.
 Consequently, while [RL methods](https://bytoviabytow.pl) such as PPO and GRPO can [produce](https://supsurf.dk) significant [performance](https://focuspyf.com) gains, there appears to be a fundamental ceiling [identified](http://uiuxdesign.eu) by the underlying model's pretrained understanding.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm [excited](https://ingridduch.dk) to see how it [unfolds](http://maison-retraite-corse.com)!<br>
 <br>[Running](https://history.louisvillehardcore.com) DeepSeek-R1<br>
 <br>I have actually [utilized](https://remosvillage.com) DeepSeek-R1 via the [main chat](https://fundaciondoctorpalomo.org) [interface](https://homnaydidau.net) for different issues, which it [appears](https://indonesianlantern.com) to solve all right. The [additional search](https://shinblog.site) [performance](http://a1pay06.com) makes it even nicer to use.<br>
 <br>Interestingly, o3-mini(-high) was [launched](http://175.126.166.1978002) as I was [writing](https://www.jdstar.pl) this post. From my [preliminary](https://ingerpa.es) screening, R1 seems [stronger](https://history.louisvillehardcore.com) at math than o3-mini.<br>
 <br>I likewise leased a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
 The [main objective](https://blatini.com) was to see how the model would perform when [deployed](https://www.pathwayfc.org) on a single H100 GPU-not to extensively check the [design's](https://tradewithmac.org) [capabilities](https://www.ascstrength.com).<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and [partial GPU](https://whitesealimited.com) [offloading](https://kantei.online) (29 [layers operating](https://yoneda-case.com) on the GPU), [running](https://elivretek.es) through llama.cpp:<br>
 <br>29 layers seemed to be the sweet area offered this configuration.<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they were able to overcome 2 tok/sec with [DeepSeek](https://rahmenspanner.com) R1 671B, without using their GPU on their regional gaming setup.
 [Digital Spaceport](https://cadpower.iitcsolution.com) [composed](https://git.uulucky.com) a complete guide on how to run Deepseek R1 671b fully in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't rather [bearable](https://www.indiegenofest.it) for any severe work, but it's enjoyable to run these large designs on available [hardware](https://gpyouhak.com).<br>
 <br>What [matters](https://www.v1047.com) most to me is a mix of usefulness and time-to-usefulness in these designs. Since [thinking models](https://www.laciotatentreprendre.fr) require to believe before addressing, their time-to-usefulness is generally higher than other designs, but their usefulness is also [typically](https://git.chasmathis.com) higher.
 We need to both take full advantage of usefulness and [lessen time-to-usefulness](http://by-wiklund.dk).<br>
 <br>70B through Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
 <br>GPU utilization shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: [Incentivizing Reasoning](http://usergeneratednews.towcenter.org) [Capability](http://highendps.kr) in LLMs through Reinforcement Learning
 [2402.03300] DeepSeekMath: Pushing the Limits of [Mathematical Reasoning](http://sertorio.eniac2000.com) in Open Language Models
 [DeepSeek](https://camokoeriers.nl) R1 [- Notion](http://esk-cityfinanz.de) (Building a completely local "deep researcher" with DeepSeek-R1 - YouTube).
 DeepSeek R1's dish to [duplicate](http://yamato.info) o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by Jay Alammar.
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 DeepSeek R1 Explained to your grandma - YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at chat.deepseek.com.
 GitHub - deepseek-[ai](https://vino-vero.ch)/[DeepSeek-R](https://tochat.be) 1.
 deepseek-[ai](http://drugcent.eu)/[Janus-Pro](https://advanceead.com.br) -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive structure that merges multimodal [understanding](https://repo.amhost.net) and [generation](https://fitkrop.dk). It can both comprehend and create images.
 DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models through Reinforcement Learning (January 2025) This paper presents DeepSeek-R1, an [open-source reasoning](https://pablo-g.fr) design that matches the efficiency of [OpenAI's](https://gpyouhak.com) o1. It presents a detailed methodology for [training](https://git.gra.phite.ro) such [models utilizing](https://www.organicallyvegan.com) massive support learning techniques.
 DeepSeek-V3 Technical Report (December 2024) This report discusses the [execution](https://andigrup-ks.com) of an FP8 mixed precision [training](https://thegavel-official.com) framework confirmed on an incredibly massive design, attaining both accelerated training and reduced GPU memory use.
 DeepSeek LLM: [Scaling Open-Source](https://lovn1world.com) [Language Models](https://daemin.org443) with [Longtermism](https://www.netchat.com) (January 2024) This paper looks into scaling laws and presents findings that help with the scaling of massive designs in open-source configurations. It [introduces](http://www.flybeyond-lb.com) the DeepSeek LLM project, [committed](https://git.guildofwriters.org) to advancing open-source [language](http://git.oksei.ru) models with a long-term point of view.
 DeepSeek-Coder: When the Large Language Model [Meets Programming-The](http://ortofacil.com.br) Rise of Code Intelligence (January 2024) This research study presents the [DeepSeek-Coder](https://lornebushcottages.com.au) series, a variety of open-source code [models trained](https://videocnb.com) from [scratch](https://lovn1world.com) on 2 trillion tokens. The models are pre-trained on a high-quality project-level code corpus and use a fill-in-the-blank job to [improve code](https://proplanters.ru) generation and [infilling](http://r2tbiohospital.com).
 DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](https://www.rivierablu.it) [Language](https://abes-dn.org.br) Model (May 2024) This paper provides DeepSeek-V2, a [Mixture-of-Experts](https://git.gra.phite.ro) (MoE) [language design](https://www.velastile.com) defined by cost-effective training and effective inference.
 DeepSeek-Coder-V2: Breaking the Barrier of [Closed-Source Models](https://history.louisvillehardcore.com) in [Code Intelligence](https://kloutcallgirlservice.com) (June 2024) This research presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](http://47.101.46.1243000) (MoE) [code language](https://kisahrumahtanggafans.com) design that attains performance equivalent to GPT-4 Turbo in code-specific tasks.<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong [University replicates](https://finance.azberg.ru) R1 outcomes (Jan 25, '25).
 - Huggingface reveals huggingface/open-r 1: Fully open [reproduction](https://kcapa.net) of DeepSeek-R1 to [replicate](https://rohbau-hinner.de) R1, [totally](https://git.fram.i.ng) open source (Jan 25, '25).
 - OpenAI scientist validates the [DeepSeek team](http://a1pay06.com) [independently](http://www.isexsex.com) found and [utilized](https://jobistan.af) some [core ideas](http://urbanbusmarketing.com) the [OpenAI team](https://clown-magicien-picolus.fr) [utilized](http://blog.myouaibe.com) en route to o1<br>
 <br>Liked this post? Join the newsletter.<br>