LLM-server

Some BOSS-Offline reports use generative AI based on the LLM neural network, so to use them, you need to configure the settings at this page.

You can configure either a local server or a cloud server, or both at the same time.
If both at the same time are configured, the local server will take priority, except when neutral data (that does not contain confidential or personal information) is being transmitted.

For a local server Ollama framework is supported, and ChatGPT / YandexGPT / Gemini for a cloud server.

Server URL
specify http or https URL of the server with Ollama installed
As usually, this is http 11434
Example:
http://192.168.0.111:11434

API-key
ChatGPT: you should create API-key and copy it here.
YandexGPT: you should create billing account here, and then obtain OAuth-token and copy it here.
Gemini: you should create API-key, connect billing to it, top up balance, and then copy key here.

Model
Ollama: specify the loaded model to use, currently, models from qwen3 or deepseek-r1 are recommended.
For example:
deepseek-r1:14b
deepseek-r1:32b
qwen3:14b
qwen3:32b
You need to specify the exact model that downloaded and installed in Ollama. Complete models list available on the Ollama website.
ChatGPT:
gpt-4o
o4-mini
gpt-4.1
gpt-4.1-mini
gpt-5
gpt-5-mini
gpt-5.1
and others
YandexGPT:
gpt://<folder_ID>/yandexgpt
gpt://<folder_ID>/yandexgpt/latest
gpt://<folder_ID>/yandexgpt-lite
Gemini:
gemini-2.5-flash
gemini-2.5-flash-lite
gemini-2.5-pro
gemini-3.1-pro-preview
gemini-3-flash-preview
gemini-flash-latest
gemini-pro-latest
and others

Ollama:
- using a GPU with CUDA support is not required for operation, but is highly recommended, because the performance will be an order of magnitude higher even in comparison with multi-core CPU servers!
- the model must fit completely into the video memory or RAM;
- the larger the model, the better the quality, but the slower the speed;
- it is allowed to use several GPUs (if the video memory of one GPU is not enough to accommodate the entire model);
- when using GPU, CPU and RAM resources can be minimal (for example, 2 CPUs and 4 GB RAM are quite enough).

Example of installing Ollama on Linux Ubuntu (it is assumed that the GPU drivers are already installed):

curl -fsSL https://ollama.com/install.sh | sh

For non-localhost access and increasing the allowed model loading time, it is recommended to make additional settings:

sudo nano /etc/systemd/system/ollama.service

The following lines should be added to the [Service] section:

Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_LOAD_TIMEOUT=60m"

Then save the file and execute:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Then you need to download and install the model. For example, qwen3:32b

ollama run qwen3:32b

Attention! If you see the error message "model requires more system memory than is available" while loading a model, even though there is enough VRAM to accommodate the model, the cause is most likely that Ollama's default context window for models is set to a fairly large size - 64K or more, which takes up additional VRAM. In this case, you should ignore this error and look at the "Context window size" setting on this page. This parameter literally means "how much information the model can hold in memory simultaneously during a request," and is specified in tokens. The larger the parameter, the more VRAM is required and the larger the request to the neural network can be. If the context window is too small, the response will be of lower quality, as the neural network may "forget" or not see part of the query. For current system tasks, 16384 is generally sufficient (if VRAM allows, you can set it higher), and the minimum recommended value is 4096. If you set it to 0, the Ollama framework itself will determine the parameter based on the loaded model. However, setting the value too high is also not recommended, as it should not exceed the maximum for a given model (see the description of the specific model).

Finding out the current VRAM usage is usually convenient using the command:

nvidia-smi

v11.3.3651 (build: May 13 2026)
Introduction +Software suite structure +Software suite installation +Software suite uninstall +Software suite update -Global settings Database users -Software suite settings General settings description -Server settings Common settings Postponed monitoring Monitoring - Screenshots Monitoring - Webcams Monitoring - Autorecording Monitoring - Printing Monitoring - Shadow copy Monitoring - File hashes Monitoring - Users online Monitoring - Global search Monitoring - Chats-calls Face recognition Text recognition (OCR) Text classification Neural network server LLM-server Azure-integration Webex-integration Reports generator - Parameters Reports generator - Reports (for bosses) Reports generator - Reports (for employees) Reports generator - Saving to folder Reports generator - Sending via FTP Reports generator - Sending by e-mail Reports generator - Sending to website Reports generator - Sending to file sharing Reports generator - Threats Notifications generator - Sending by e-mail Notifications generator - Sending by SMS Notifications generator - Integration with messengers Notifications generator - 2FA (BOSS) Client protection Events Regular expressions Work schedule syslog Web-interface Vocabularies +Client settings (computer) +Client settings (user) Groups Company structure Work schedules Dossier of employees Sync with AD Risk analyzer Report templates File hashes Tariffs List of users Work with DB SQL-console Journal +Other +FAQ +Technical support	LLM-server Some BOSS-Offline reports use generative AI based on the LLM neural network, so to use them, you need to configure the settings at this page. You can configure either a local server or a cloud server, or both at the same time. If both at the same time are configured, the local server will take priority, except when neutral data (that does not contain confidential or personal information) is being transmitted. For a local server Ollama framework is supported, and ChatGPT / YandexGPT / Gemini for a cloud server. Server URL specify http or https URL of the server with Ollama installed As usually, this is http 11434 Example: http://192.168.0.111:11434 API-key ChatGPT: you should create API-key and copy it here. YandexGPT: you should create billing account here, and then obtain OAuth-token and copy it here. Gemini: you should create API-key, connect billing to it, top up balance, and then copy key here. Model Ollama: specify the loaded model to use, currently, models from qwen3 or deepseek-r1 are recommended. For example: deepseek-r1:14b deepseek-r1:32b qwen3:14b qwen3:32b You need to specify the exact model that downloaded and installed in Ollama. Complete models list available on the Ollama website. ChatGPT: gpt-4o o4-mini gpt-4.1 gpt-4.1-mini gpt-5 gpt-5-mini gpt-5.1 and others YandexGPT: gpt://<folder_ID>/yandexgpt gpt://<folder_ID>/yandexgpt/latest gpt://<folder_ID>/yandexgpt-lite Gemini: gemini-2.5-flash gemini-2.5-flash-lite gemini-2.5-pro gemini-3.1-pro-preview gemini-3-flash-preview gemini-flash-latest gemini-pro-latest and others Ollama: - using a GPU with CUDA support is not required for operation, but is highly recommended, because the performance will be an order of magnitude higher even in comparison with multi-core CPU servers! - the model must fit completely into the video memory or RAM; - the larger the model, the better the quality, but the slower the speed; - it is allowed to use several GPUs (if the video memory of one GPU is not enough to accommodate the entire model); - when using GPU, CPU and RAM resources can be minimal (for example, 2 CPUs and 4 GB RAM are quite enough). Example of installing Ollama on Linux Ubuntu (it is assumed that the GPU drivers are already installed): curl -fsSL https://ollama.com/install.sh \| sh For non-localhost access and increasing the allowed model loading time, it is recommended to make additional settings: sudo nano /etc/systemd/system/ollama.service The following lines should be added to the [Service] section: Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_LOAD_TIMEOUT=60m" Then save the file and execute: sudo systemctl daemon-reload sudo systemctl restart ollama Then you need to download and install the model. For example, qwen3:32b ollama run qwen3:32b Attention! If you see the error message "model requires more system memory than is available" while loading a model, even though there is enough VRAM to accommodate the model, the cause is most likely that Ollama's default context window for models is set to a fairly large size - 64K or more, which takes up additional VRAM. In this case, you should ignore this error and look at the "Context window size" setting on this page. This parameter literally means "how much information the model can hold in memory simultaneously during a request," and is specified in tokens. The larger the parameter, the more VRAM is required and the larger the request to the neural network can be. If the context window is too small, the response will be of lower quality, as the neural network may "forget" or not see part of the query. For current system tasks, 16384 is generally sufficient (if VRAM allows, you can set it higher), and the minimum recommended value is 4096. If you set it to 0, the Ollama framework itself will determine the parameter based on the loaded model. However, setting the value too high is also not recommended, as it should not exceed the maximum for a given model (see the description of the specific model). Finding out the current VRAM usage is usually convenient using the command: nvidia-smi
© KICKIDLER DLP