====== ksoftirqd process utilizes 100% of a CPU ======

Linux esegue diversi processi per gestire le soft IRQ, uno per ogni CPU presente. Ecco un esempio su una CPU con 4 core:

<code>
ps uax | grep ksoftirqd | grep -v grep
root          14  0.0  0.0      0     0 ?        S    Nov07   0:00 [ksoftirqd/0]
root          21  1.4  0.0      0     0 ?        S    Nov07  21:43 [ksoftirqd/1]
root          26  0.0  0.0      0     0 ?        S    Nov07   0:03 [ksoftirqd/2]
root          31  0.0  0.0      0     0 ?        S    Nov07   0:00 [ksoftirqd/3]
</code>

Come si vede uno di questi processi (PID 21) consuma molti cicli di CPU più degli altri (TIME 21:43). Il processo è associato alla **CPU #1** [ksoftirqd/1].

Ispezionando lo pseduofile **/proc/interrupts** si può vedere chi è che genera molte interrupt sulla **CPU1**:

<code>
cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       
   0:         19          0          0          0  IR-IO-APIC    2-edge      timer
   8:          0          0          0          0  IR-IO-APIC    8-edge      rtc0
   9:          0          0          0          0  IR-IO-APIC    9-fasteoi   acpi
  14:          0          0          0          0  IR-IO-APIC   14-fasteoi   INTC1057:00
  16:          0        590          0          0  IR-IO-APIC   16-fasteoi   mmc0, idma64.0, i801_smbus, ttyS0
  37:          0          0          0          0  IR-IO-APIC   37-fasteoi   idma64.1, pxa2xx-spi.1
 120:          0          0          0          0  DMAR-MSI     0-edge       dmar0
 121:          0          0          0          0  DMAR-MSI     1-edge       dmar1
 125:          0    4289304          0          0  IR-PCI-MSI   1048576-edge net0
 126:          0          0    4456557          0  IR-PCI-MSI   1572864-edge lan0
 127:          0          0          0     432003  IR-PCI-MSI   376832-edge  ahci[0000:00:17.0]
 128:        442          0          0          0  IR-PCI-MSI   327680-edge  xhci_hcd
 129:          0         42          0          0  IR-PCI-MSI   360448-edge  mei_me
 130:          0          0        164          0  IR-PCI-MSI   32768-edge   i915
 131:          0          0          0          0  IR-PCI-MSI   524288-edge  rtw88_pci
 132:        665          0          0          0  IR-PCI-MSI   514048-edge  snd_hda_intel:card0
...
</code>

Come si vede l'**IRQ 125** ha //affinità// alla **CPU1**, così come la **IRQ 126** ha //affinità// alla **CPU2**, ecc. Si può verificare comunque che il kernel non impone tale affinità, infatti ispezionando lo pseudofile **/proc/irq/IRQ_NUMBER/smp_affinity** si vede:

<code>
cat /proc/irq/125/smp_affinity
f
</code>

Dove **f** un valore esadecimale che rappresenta una bitmask dei core CPU; tale IRQ può essere quindi servita da uno qualunque dei core.

Dall'elenco delle interrupt si deduce che probabilmente vi è una situazione critica relativamente all'interfaccia **net0**, abbiamo una conferma constatando il gran numero di pacchetti **dropped**:

<code>
ifconfig net0
net0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.254  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::2e0:4cff:fe60:11e4  prefixlen 64  scopeid 0x20<link>
        ether 00:e0:4c:60:11:e4  txqueuelen 1000  (Ethernet)
        RX packets 3943557  bytes 3439040842 (3.2 GiB)
        RX errors 0  dropped 107001  overruns 0  frame 0
        TX packets 2410528  bytes 714236244 (681.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
</code>

===== softnet_stat =====

Grazie al contenuto di **/proc/net/softnet_stat** è possibile avere statistiche sulla gestione da parte delle softirq dei pacchetti network. Ogni riga si riferisce ad un CPU core, le colonne indicano:

  - Total number of processed packets (processed).
  - Times ksoftirq ran out of quota (dropped).
  - Times net_rx_action was rescheduled.
  - Number of times processed all lists before quota.
  - Number of times did not process all lists due to quota.
  - Number of times net_rx_action was rescheduled for GRO (Generic Receive Offload) cells.
  - Number of times GRO cells were processed.

Per vedere la tabella in formato decimale invece che esadecimale si può ricorrere al comando:

<code>
awk '{for (i=1; i<=NF; i++) printf strtonum("0x" $i) (i==NF?"\n":" ")}' /proc/net/softnet_stat | column -t
</code>

===== netdev_max_backlog =====

Se il numero nella **seconda colonna** cresce costantemente (frame persi a causa di una coda backlog piena), è possibile incrementare la dimensione del buffer. Il valore corrente si vede con:

<code>
sysctl net.core.netdev_max_backlog
net.core.netdev_max_backlog = 1000
</code>

Quindi si può creare un file **/etc/sysctl.d/10-netdev_max_backlog.conf** con il seguente contenuto

<file>
net.core.netdev_max_backlog = 2000
</file>

e renderlo esecutivo con:

<code>
sysctl -p /etc/sysctl.d/10-netdev_max_backlog.conf
</code>

===== netdev_budget =====

Anche la **terza colonna** di **/proc/net/softnet_stat**, se cresce costantemente, può indicare un problema. In questo caso significa che il budget dedicato a gestire il traffico ricevuto è esaurito e la softirq viene rischedulata. Si tratta in pratica del numero di volte che un processo softirqd non è riuscito a processare tutti i pacchetti da una interfaccia durante un ciclo di polling NAPI.

Il budget assegnato è dato dai parametri **netdev_budget** e **netdev_budget_usecs** che sono consultabili con i comandi:

<code>
sysctl net.core.netdev_budget
net.core.netdev_budget = 300

sysctl net.core.netdev_budget_usecs
net.core.netdev_budget_usecs = 8000
</code>

In questo caso softirqd ha, per ogni ciclo di polling, 8000 microsecondi di tempo massimo per processare fino a 300 messaggi dalla scheda di rete. Possiamo aumentarli creando un file **/etc/sysctl.d/10-netdev_budget.conf** con:

<file>
net.core.netdev_budget = 600
net.core.netdev_budget_usecs = 24000
</file>

ed eseguendo

<code>
sysctl -p /etc/sysctl.d/10-netdev_budget.conf
</code>

===== Shorewall (netfilter) logging =====

Purtroppo nel nostro caso **non si sono ottenuti risultati incrementando il budget** per la softirq. L'host in questione è un firewall GNU/Linux con due schede di rete, il problema era associato all'interfaccia di rete esterna, protetta da **regole netfilter** tramite il software **Shorewall**. Il numero di regole complessivamente impostate era assolutamente modesto (157 regole):

<code>
iptables -S | wc -l
157
</code>

Il problema si manifestava in condizioni di portscan intensi, intorno ai 3 pacchetti/secondo. In questo caso il logging dei pacchetti dropped (tramite syslog venivano scritti su **/var/log/syslog** e su **/var/log/kern.log**) causava una perdita apprezzabile di pacchetti di rete.

Avendo **eliminato il logging dei pacchetti dropped** il problema pare risolto.

===== Web References =====

  * **[[https://portal.nutanix.com/page/documents/kbs/details?targetId=kA07V000000LUR3SAO|Linux VM performance: ksofirqd process utilizes 100% of a vCPU ...]]**
  * **[[https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/9/html/monitoring_and_managing_system_status_and_performance/tuning-the-network-performance_monitoring-and-managing-system-status-and-performance|Tuning the network performance]]**
  * **[[https://jsevy.com/network/Linux_network_stack_walkthrough.html|Linux Network Stack Walkthrough]]**
  * **[[https://learn.netdata.cloud/docs/collecting-metrics/linux-systems/network/softnet-statistics|Softnet Statistics]]**