cmk version: CEE v2.1.0p24
OS: ubuntu 22.04.2 LTS

Problem: fetcher usage is increasing until site restart or server reboot

the site has 40 fetchers defined, if the site is restarted the fetcher usage is about 20%, but soon the fetcher usage starts to increase slowly. After 4 weeks the usage is about 80% and goes in direction 100%.

how can I fix this? or find the root cause for this. This issue has alse been with older 2.1 versions on this server.

In the kernel message log I found this messages:
[Apr 4 16:51] python3[4068036]: segfault at 8 ip 00007f74056f66ec sp 00007ffdc51510e0 error 4 in libpython3.9.so.1.0[7f740561b000+1ce000]
[ +0.000036] Code: 89 e2 48 89 de 48 0f ba f1 3f 4c 89 ef e8 dc 84 f2 ff eb b2 e8 65 8e f2 ff eb b9 0f 1f 00 55 48 89 fd 48 89 f7 53 48 83 ec 18 <48> 8b 45 08 48 8b 40 40 48 85 c0 0f 85 a6 d2 f3 ff e8 6e 57 f2 ff
[Apr 4 18:57] python3[4084587]: segfault at 8 ip 00007fe4a88546ec sp 00007ffee5b0c940 error 4 in libpython3.9.so.1.0[7fe4a8779000+1ce000]

I guess they have something to do with the increasing fetchers. I also noticed thate some fetcher processes, have been restarted.root@cmk-server:~# ps -ef |grep fetch
site1 1481 1431 0 Mar16 ? 00:03:52 python3 /omd/sites/site1/bin/fetcher
site1 1487 1431 0 Mar16 ? 00:06:17 python3 /omd/sites/site1/bin/fetcher
site1 1498 1431 0 Mar16 ? 00:05:20 python3 /omd/sites/site1/bin/fetcher
site1 1499 1431 0 Mar16 ? 00:04:59 python3 /omd/sites/site1/bin/fetcher
site1 1500 1431 0 Mar16 ? 00:05:51 python3 /omd/sites/site1/bin/fetcher
site1 1507 1431 0 Mar16 ? 00:07:04 python3 /omd/sites/site1/bin/fetcher
site1 1509 1431 0 Mar16 ? 00:06:10 python3 /omd/sites/site1/bin/fetcher
site1 1511 1431 0 Mar16 ? 00:41:46 python3 /omd/sites/site1/bin/fetcher
site1 8436 1431 0 Mar20 ? 00:04:45 python3 /omd/sites/site1/bin/fetcher
site1 46673 1431 1 08:43 ? 00:06:23 python3 /omd/sites/site1/bin/fetcher
site1 89353 1431 0 Mar20 ? 00:06:54 python3 /omd/sites/site1/bin/fetcher


----------------------------------------
>>>> SOLUTION:
----------------------------------------
- enabling debug in global settings, enable debugging of helpers, now you should see a lot of entries in the cmc.log file!
- now wait until another crash has happend (check with command "dmesg -H -P")
- if you have the timestamp of the crash >> check the cmc.log file
- in the logfile I found that it happends after a snmp usv query occured
- reconfig the device to use "classic snmp" to query the define instead of inline python checkmk snmp
>> Problem solved.

computer2know :: thank you for your visit :: have a nice day :: © 2024