Spaces or not, and is 0.6b enough?
I've noticed that sometimes lack of spaces in a part of the prompt can improve the (perceived) quality, while in other cases it can break everything. I use the official order [quality/meta/year/safety tags] [1girl/1boy/1other etc] [character] [series] [artist] [general tags], and so far:
- missing spaces in the first section can improve quality
- missing spaces between [character] [series] [artist] break artist styles
- artist tags sometimes work better with spaces, and sometimes - with underscores
- artist tags sometimes work better with escaped brackets, like
\( \), sometimes without.
I have a suspicion that 0.6B LLM is just a bit too small to be consistent - have you tested it? Is tagging consistent in the dataset?
It's up to you to use underscores or without, but in comfyui parentheses "(" ")" are weighted prompts. What you're experiencing is just placebo since the model is not finished training, you're bound to get varied outputs. You can be sure that 0.6B can handle more than SDXL...
It's up to you to use underscores or without, but in comfyui parentheses "(" ")" are weighted prompts. What you're experiencing is just placebo since the model is not finished training, you're bound to get varied outputs.
Looking at ComfyUI code, I don't see weights disabled - would it conflict with the model's capabilities?
You can be sure that 0.6B can handle more than SDXL...
That's not a good comparison, Clip is outdated - is it comparable to t5xxl used in Chroma? T5 is not an LLM, but it's also bigger, so I'm not sure which gives better text "understanding".
I would agree with the part that 0.6B as Text Encoder is a very questionable idea, i feel it's the main component really hindering model's true potential. 4B LLM would enhance the model immensely. Also it's the least demanding component, as it's often offloaded to RAM.
4B LLM would enhance the model immensely. Also it's the least demanding component, as it's often offloaded to RAM.
That would probably be a bit too much for the 2b main model. However, a 2b text encoder would be nice - it's still quite slim, but should give better prompt adherence. There's a project to adapt 1b Gemma and 2b t5-gemma to SDXL as an example.
4B LLM would enhance the model immensely. Also it's the least demanding component, as it's often offloaded to RAM.
That would probably be a bit too much for the 2b main model. However, a 2b text encoder would be nice - it's still quite slim, but should give better prompt adherence. There's a project to adapt 1b Gemma and 2b t5-gemma to SDXL as an example.
2b most definitely.. of course the model is in preview, but i find prompting quite inconsistent, which makes sense for the 600M size.
4b is overkill for this type of model imo.
I'm happy that the text encoder is small enough to be almost instantenous in a workflow even on a slower system, as opposed to most other models these days where the TE can't even run on low VRAM (Gemma 12b for LTX2, for example). Sometimes Anima seems to have problems generating parts of the prompt but at this point I'm not sure if it's limited by the TE or the DiT being still a bit undertrained and lacking familiarity with some concepts or strong enough relationships between them.
It would of course be best if text encoders of the same family could be simply interchanged, but I don't think any image model has that kind of general ability? ACE-Step 1.5 can use the thinking LM of either 0.6b, 1.7b or 4.0b but that's a bit different because the LM generates plain text before it's subsequently fed into the embedding model to encode into tensors.
I'm happy that the text encoder is small enough to be almost instantenous in a workflow even on a slower system, as opposed to most other models these days where the TE can't even run on low VRAM (Gemma 12b for LTX2, for example). Sometimes Anima seems to have problems generating parts of the prompt but at this point I'm not sure if it's limited by the TE or the DiT being still a bit undertrained and lacking familiarity with some concepts or strong enough relationships between them.
Yes, LTX2 and Flux.2 (24b Mistral for text encoding) are unfortunate examples of poorly chosen text encoder sized, but these are extreme examples. A slightly bigger size should help, I think.
It would of course be best if text encoders of the same family could be simply interchanged, but I don't think any image model has that kind of general ability? ACE-Step 1.5 can use the thinking LM of either 0.6b, 1.7b or 4.0b but that's a bit different because the LM generates plain text before it's subsequently fed into the embedding model to encode into tensors.
That's actually a great example and is very interesting - in my experience, 0.6b + 4b works better for ACE-Step 1.5, and something similar would be awesome (but also a tremendous amount of additional work) for an image generation model.