A black box on the desk

For about a month, there has been a small black computer weighing just over a kilo on my desk called DGX Spark. Produced by NVIDIA, the company that builds the processors on which practically all the world's artificial intelligence runs, the idea behind this device is simple to explain: take the type of computation that usually lives in large data centers and put it in a box to keep at home, plugged into a regular power outlet.

For the first few days, I used it like almost everyone else does, i.e., to run language models, programs like ChatGPT that generate text. Then I decided to do what I had actually bought it for: turning it into a small video production studio, where artificial intelligence generates videos from written descriptions, all locally.

Locally means that every computation happens inside the box on my desk, without going through the internet, without monthly subscriptions, and without my ideas ending up on someone else's servers.

What it means to generate a video

For those who have never tried: there are artificial intelligence programs that receive a sentence, for example, "rain on a glass at night, neon lights reflected, the camera advances slowly," and produce a video of a few seconds that corresponds to that description. The sentence is called a prompt, the program is called a generative model, and the quality of results in the last two years has gone from "laboratory curiosity" to "stuff an advertising agency would use without shame."

These models are usually used for a fee, through online services that perform the computation in their data centers and return the file to you. It works, but each clip costs money, the queues at peak times are long, and everything you produce passes through the computers of a company. The alternative exists and is called open weight models: versions that companies (especially Chinese, it must be said: Alibaba and Tencent above all) publish for free, downloadable and usable on your own hardware. The problem is that they require machines with a lot, a lot of memory. And here we are at the point.

The memory problem, explained with a desk

A regular computer has two separate memories: the general one, where the system lives, and that of the graphics card, which matters for artificial intelligence. Even the most powerful gaming graphics cards stop at 32 Gigabytes, and a high-end AI video generation model alone can require much more. It’s like having an office with a small desk: if the project you’re working on won’t fit, you’ll spend time moving papers back and forth from the filing cabinet, and the day will go by in moving.

The DGX Spark solves the problem in an unusual way: it has a single 128 Gigabyte memory, shared between the processor and graphics card. It’s called unified memory, and it’s like replacing a school desk with a four-meter table: all the work material is there at the same time, always at hand. In my case, on the table coexist the video model, the model that generates the starting images, and a language assistant that I will discuss shortly. All together, all loaded, no moving.

There is a price to pay, and it should be said upfront because it is this machine's true limit: its memory is large but not very fast, and every single operation turns out to be slower than on a traditional high-end graphics card. My solution has been more organizational than technical: the Spark can work at night, queuing processes, and in the morning I check the results. The slowness, when you sleep, is not noticeable.

The tools I've used

The program that serves as a control center is called ComfyUI, is free and works from the browser: you build the workflow by connecting blocks on a diagram, somewhat like a flowchart, and each block does part of the work (here the text comes in, here the model generates, here the video is saved). NVIDIA publishes an official guide for installing it on the Spark, and, to my surprise, it worked on the first try in less than an hour.

As for the models, after weeks of trials, I assigned a role to each. Wan 2.2, released by Alibaba in 2025, is the one I use for the most polished videos, especially when there are people and complicated movement in the scene; it’s slow, and I reserve it for final renderings. LTX-2, from an Israeli company, is much faster and also generates audio along with images: perfect for drafts, when I need to see if an idea works before investing half an hour of computation. For fixed starting images, I use FLUX.2, and when legible text inside the image is required (a sign, a title, the graphics of a poster), I switch to the Qwen-Image 2.0 family, which, for that specific task, in mid-2026, is among the most convincing I've tried.

A word you will hear often if you approach this world is quantization. It’s a form of compression: models are reduced in size by accepting a slight loss of precision, like an MP3 compared to a CD. On a machine like the Spark, quantization is not a fallback; it’s the right choice: compressed models take up less memory and run faster, and the difference in quality is scarcely visible to the eye.

Hermes, the assistant that works at night

The part of the experiment I cared about most involves Hermes 4, an open-weight language model published by Nous Research, an American lab. I installed it on the same machine, next to the video models, and assigned it the boring task: I write an idea in two lines, like "eight-second spot for a coffee, night atmosphere," and it transforms it into six detailed and professional descriptions, complete with directing instructions, lighting, and technical parameters already ready for ComfyUI.

Hermes comes in various sizes measured in billions of parameters, which is how we indicate the size of these models: more parameters, more capabilities, more memory used, and slower performance. I tried the 70 billion version, which comfortably fits into the 128 gigabytes but generates text at the speed of a telegraph, and I settled on the 36 billion version, which thinks well and responds in decent time. I even held a small formal contest among the three sizes available, giving them the same tasks and scoring every response, because I trust impressions little, especially my own.

The workflow operates like this: in the evening, I leave my notes in a folder, at 11:00 PM a timer wakes the system, Hermes expands every note into variants and queues them in ComfyUI, the machine generates until dawn consuming less than a microwave oven, and in the morning I find the clips sorted by project. I choose the best ones with a coffee in hand, and only those proceed to high-quality rendering. I’m also experimenting with a further step, which is an automatic controller: a second model examines the frames of the clips, describes them to Hermes, and Hermes discards those that are obviously wrong before I see them. It lacks my taste, but it recognizes gross mistakes very well and saves me half of my morning review.

A step back: what are local models and how do they work

Before continuing, it’s worth pausing for a moment because I’ve taken a lot for granted so far. An artificial intelligence model, stripped of all mystique, is a file. A very large file, ranging from a few Gigabytes (GB) to several hundred GB, that contains billions of numbers: these are the parameters, i.e., the result of training, the compressed knowledge of the model. That file alone does nothing, just like an MP3 does not play without a player. It requires a program to load it into memory and make it work: the most widespread ones are called Ollama, LM Studio, and llama.cpp, they are free, and installing one takes about ten minutes. I write about it on RunLocal (runlocal.blog), where I also keep a guide for choosing the right model for the computer you have, with something like a configurator.

The operation, simplifying a bit: the model reads your request and generates the answer piece by piece, one token after another, and each token requires sifting through all those billions of parameters in memory. Hence the two practical rules that govern the whole local world. First, the model must fit into the computer's memory; otherwise, it either doesn’t start or slows down grotesquely. Second, the faster the memory, the more tokens per second come out. Everything else, from quantization to format acronyms, descends from these two rules.

One term worth knowing because it ties together the two rules is MoE (mixture of experts). Instead of one monolithic block that activates entirely for every word, an MoE model is divided into many specialized sub-models, the experts, and for each token, it activates only a small part, those that are needed at that moment. The point, for a machine like the Spark or a Mac with 128 GB of unified memory, is that the two things pull in opposite directions and both in your favor: in memory, you still have to keep the entire model, experts included, so large capacity is indeed useful, but for every token, only a slice actually runs, and since speed depends on how many parameters you need to reread each time, you end up with an enormous model that responds swiftly as if it were much smaller. That’s why many of the models I’ve mentioned, from Wan 2.2 to the more capable Qwen and DeepSeek, are constructed this way: specifically designed for those with a lot of memory but not equally generous bandwidth, which is the exact condition of this box.

I add the question many ask me: are they as good as ChatGPT? My answer is that the best open models of 2026 are playing in the same league as commercial services from a year ago, and for most everyday tasks, the difference is imperceptible. However, the three advantages that made me choose this path are noticeable: the data stays in-house, the model downloaded today will work even in 2030 regardless of what the publisher decides, and if better ones come out in the meantime, so be it, what I have on disk remains mine and keeps running; and after the hardware purchase, my usage costs are zero.

Which model for which job

The typical mistake of beginners is to seek the best model overall, which does not exist, just as there is no best tool in a toolbox. There is the right model for the task, and often the workflow matters as much as the model. Here’s my map, updated to June 2026, with the models I’ve actually used. Where I indicate large sizes, I think of my DGX Spark or a Mac with a lot of unified memory; for a regular laptop, the smaller version of the same family applies.

For chatting and getting daily help.

The Qwen 3.x family from Alibaba is the starting point I recommend: permissive license in smaller sizes, excellent Italian, and a range of sizes (from 4B for lightweight laptops to larger sizes for serious machines) that allows for growth without changing habits. I write 3.x and not a specific number on purpose because the line is updated so often that reasoning by family, without chasing the last released digit, is the only way to avoid being left with an outdated piece after a month. Workflow: install Ollama, download the model with a command, and keep the chat window open like you would keep a browser open. On the Spark, I run one of the larger sizes and feel no lack of the cloud.

To summarize documents.

Here speed counts more than intelligence because a mediocre summary in ten seconds beats a perfect summary in five minutes, at least when there are fifty documents. A small model like a Qwen 3.x in the 7-8 billion sizes, or an Llama 3.1 8B, does the job excellently. The flow I use: low temperature (it’s the parameter that regulates creativity, and in summaries, creativity is a flaw), a fixed request model specifying length and format, and documents fed in series with a script. It’s the classic work to run while having lunch.

For daily office work: emails, rewrites, translations.

My European recommendation is Mistral Medium 3.5, released by Mistral at the end of April 2026: a robust model with a very wide context window and an open license. There's also an aspect that matters to me, even if it's more of a practical comfort than a real legal guarantee: it comes from a French lab, and for those handling data subject to European regulations, this makes some conversations with the legal department a bit less awkward. On my machine, Hermes 4.3 36B holds the fixed position, which I keep for affection and the already mentioned calling tool. Workflow: the model runs as a permanently on service, and you talk to it from the editor or a keyboard shortcut, without opening a dedicated program each time.

For programming: generating code and debugging.

Among the most interesting options in mid-2026 is DeepSeek V4, particularly the Flash version, which has an MIT license, the one that truly allows any use, and which performs really well for generating code and debugging. The workflow here is richer, and I copy that of professionals, who keep two models instead of just one. A small and lightning-fast model for autocomplete while writing, and the large model, queried by the editor (VS Code and similar connect to any local service compatible with OpenAI APIs), for heavy tasks: explaining an error, refactoring a file, writing tests. It must be said that it’s the most cumbersome piece of the whole setup because DeepSeek V4 Flash barely fits into the 128 Gigabytes only with a really aggressive quantization (project ds4/DwarfStar and asymmetric quantization Q2), but it's probably not the most suitable model for daily use, given that it nearly saturates all 128 GB of memory in the DGX Spark.

For analyzing large amounts of personal documents.

The most underrated use case is where local embarrasses the cloud, because uploading a two-decade archive of a professional studio to an online service is exactly what many cannot or do not want to do. The right tool depends on the volume. For several hundred pages, a long-context model is used, which reads them all together: Llama 4 Scout from Meta declares a window of ten million tokens, and in the compressed version, it fits into the 128 GB of the Spark. Beyond that threshold, you switch to a technique called RAG: the documents are indexed once, and for each question, the system retrieves the relevant passages and passes them to the model. It’s more cumbersome to set up, but it scales to entire archives, and everything stays on the home disk.

For math and reasoning.

There is a recent category called reasoning models, popularized by the Chinese DeepSeek-R1: before answering, they generate a long chain of intermediate thought, check their steps, and only at the end write the answer. They are slow, sometimes exasperating, and for asking for the capital of France, they are an absolute waste. For a math problem, a logic question, or a decision with many constraints, the difference compared to normal models is clearly visible. Workflow: well-written question containing all the data, and then patience; it’s the kind of request I let simmer while I do something else, like the video clips. Distilled, smaller versions bring a piece of this skill even to laptops.

In summary, my toolbox:

Task	Recommended Model	Tool
Daily chatting	Qwen 3.x (from 4B to large sizes)	Ollama or LM Studio
Serial summaries	Qwen 3.x 7-8B / Llama 3.1 8B	Ollama + script
Emails, rewrites, translations	Mistral Medium 3.5 / Hermes 4.3 36B	Always-on service
Code and debugging	DeepSeek V4 Flash + a small model	Editor linked to local API
Analysis of many documents	Llama 4 Scout (or RAG beyond a certain volume)	llama.cpp / local index
Math and reasoning	DeepSeek-R1 and distilled versions	In queue, with patience

How long does it really take?

Indicative times for my scenario, for a 5-second clip:

Operation	Tool	Time
Quick draft, with audio	LTX-2	~3 minutes
Full quality clip	Wan 2.2	~25 minutes
Starting image	FLUX.2	~40 seconds
Upscaling to 4K	NVIDIA Upscaler	~2 minutes
From idea to 6 ready descriptions	Hermes 4.3 36B	~90 seconds

One night of automatic work produces between forty and sixty drafts plus about ten final clips. Those who pay for these volumes on online services can compare with their monthly bill; I did, and my smile lasts.

Is it worth it?

It depends on who you are, and I feel comfortable saying this without beating around the bush because a lot of unfiltered enthusiasm circulates about this machine. If you thrive on fast iteration, if you need to see the result ten seconds after asking for it, a workstation with a traditional graphics card remains the right choice, and the Spark will make you nervous. If, however, your way of working resembles mine, i.e., projects that mature slowly and production that can run while you do other things, then the 128 unified gigabytes open a door that until last year required a cloud subscription or a small business investment. On the price, it’s worth being explicit: we’re talking about a machine costing over 4000 Euro, so before pulling out your credit card, it’s wise to do some real accounting with how much you currently spend on the cloud, because the break-even point only comes if you keep it on for a long time.

There’s also a less technical argument that has counted for me a lot. Everything I described, from the video models to Hermes, works without a connection, without accounts, and without a provider potentially changing conditions tomorrow morning. In a time when artificial intelligence seems to exist only for rent, having a version at home that you own, with its limitations and times, is a feeling I haven’t experienced since my first assembled PC. The next step would be to connect two Sparks together, which the machine allows, but the wallet has currently cast a contrary vote, and at home, its vote carries weight.

Small glossary

Term	Meaning
Generative model	An AI program trained to produce new content (text, images, videos) based on a written request.
Prompt	The written description provided to the model to obtain the desired result.
Open weights	Models published for free, downloadable, and usable on your own computer, without going through an online service.
Unified memory	Architecture where the processor and graphics card share the same memory, avoiding continuous data transfers.
Parameters	The 'tunable neurons' of a model; size is measured in billions of parameters (14B, 36B, 70B...).
MoE (mixture of experts)	A model divided into many specialized sub-models (the experts): only a portion activates for each token.
Quantization	Compression of a model to fit into less memory, with a small loss of precision.
Token	The minimum unit of text a model reads or writes, roughly a syllable or a short word; speed is measured in tokens per second.
Batch (queue)	A mode of work where tasks are accumulated and executed in series, typically at night, instead of one at a time on demand.
ComfyUI	A free program that allows you to build generation flows by connecting visual blocks, like in a diagram.
Context window	The amount of text a model can keep in mind at one time, measured in tokens; once the threshold is exceeded, the model forgets.
RAG	Technique for querying large archives: documents are indexed, and at each question, the system retrieves only the relevant passages to pass to the model.
Reasoning model	A model that generates an intermediate chain of thought before the final answer: slower, but more reliable for math and logic.
Temperature	The parameter that regulates the creativity of the responses: low for precise tasks like summaries, high for brainstorming.
Upscaling	Intelligent enlargement of an image or video to a higher resolution (e.g., from 1080p to 4K).

For those who made it to the end: I created an Operating Guide for you, explaining step by step how to create a video pipeline with ComfyUI, cutting-edge models, and Hermes 4 locally. You can find it here.

Thanks to MSI Italia for providing me with its DGX Spark model.

I put a video studio with Artificial Intelligence into a box the size of a book