Questions and Answers :
Unix/Linux :
All disk space is consumed even if disk limit is set
Message board moderation
Author | Message |
---|---|
Send message Joined: 10 Feb 17 Posts: 6 Credit: 39,117 RAC: 0 |
I'm running Boinc 7.6.33 and Oracle VirtualBox 5.1.14 r112924 (Qt5.6.1) on Ubuntu 16.10. There is a 128GB partition allocated only for Boinc which is available under /var/lib/boinc-client . I have set up the maximum disk usage in boincmgr at Options/Computing Preferences/Disk and Memory/Disk/Use no more than: 40GB. When I start LHC@home 3..5 tasks are scheduled every 10..15 seconds. For example the task "CMS Simulation 47.60 (vbox64)" starts but it doesn't immediately uses the 1.7GB space under /var/lib/boinc-client/slots per Work Unit. It takes time for the WU to allocate that disk space. In the mean time more and more WUs are scheduled for the computer and when these tasks actually start running all the 128GB disk space is consumed even though the 40GB disk limit is set. Finally bo hainc stops working and all the WUs are also stopped when no more disk space is left and the error message is "Couldn't write state file: fwrite() failed; giving up" in /var/log/boinc.log . I have the feeling that Boinc doesn't take the real disk need into account when scheduling new WUs, e.g. for CMS Simulation for each scheduled WU 1.7GB disk space should be allocated from the allowed 40GB even if the actual usage doesn't reached the 1.7GB usage. I can WORKAROUND this problem manually if I wait for a couple of tasks to be scheduled and then use the Projects/No new tasks button before disk space is overallocated. This way some computation can be done but no new tasks are automatically scheduled to the computer when the WUs are finished. The below state show the situation Do you know how to fix this problem properly? root@quinkana3:~# cat /etc/boinc-client/cc_config.xml <!-- This is a minimal configuration file cc_config.xml of the BOINC core client. For a complete list of all available options and logging flags and their meaning see: https://boinc.berkeley.edu/wiki/client_configuration --> <cc_config> <log_flags> <task>1</task> <file_xfer>1</file_xfer> <sched_ops>1</sched_ops> </log_flags> </cc_config> root@quinkana3:~# cat /etc/boinc-client/global_prefs_override.xml <global_preferences> <run_on_batteries>0</run_on_batteries> <run_if_user_active>1</run_if_user_active> <run_gpu_if_user_active>1</run_gpu_if_user_active> <suspend_cpu_usage>0.000000</suspend_cpu_usage> <start_hour>0.000000</start_hour> <end_hour>0.000000</end_hour> <net_start_hour>0.000000</net_start_hour> <net_end_hour>0.000000</net_end_hour> <leave_apps_in_memory>0</leave_apps_in_memory> <confirm_before_connecting>1</confirm_before_connecting> <hangup_if_dialed>0</hangup_if_dialed> <dont_verify_images>0</dont_verify_images> <work_buf_min_days>0.010000</work_buf_min_days> <work_buf_additional_days>0.020000</work_buf_additional_days> <max_ncpus_pct>100.000000</max_ncpus_pct> <cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes> <disk_interval>600.000000</disk_interval> <disk_max_used_gb>40.000000</disk_max_used_gb> <disk_max_used_pct>100.000000</disk_max_used_pct> <disk_min_free_gb>0.000000</disk_min_free_gb> <vm_max_used_pct>75.000000</vm_max_used_pct> <ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct> <ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct> <max_bytes_sec_up>0.000000</max_bytes_sec_up> <max_bytes_sec_down>0.000000</max_bytes_sec_down> <cpu_usage_limit>100.000000</cpu_usage_limit> <daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb> <daily_xfer_period_days>0</daily_xfer_period_days> </global_preferences> root@quinkana3:~# boinccmd --get_host_info timezone: 3600 domain name: quinkana3 IP addr: 192.168.2.10 #CPUS: 112 CPU vendor: GenuineIntel CPU model: 06/55 [Family 6 Model 85 Stepping 2] CPU FP OPS: 2613250214.725667 CPU int OPS: 34274369110.431229 CPU mem BW: 1000000000.000000 OS name: Linux OS version: 4.8.0-37-generic mem size: 539083247616.000000 cache size: 40370176.000000 swap size: 3016749056.000000 disk size: 125488705536.000000 disk free: 89312739328.000000 root@quinkana3:~# boinccmd --get_tasks ======== Tasks ======== 1) ----------- name: CMS_29065_1487279674.946249_0 WU name: CMS_29065_1487279674.946249 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:26:16 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 391.580000 current CPU time: 399.710000 fraction done: 0.363503 swap size: 2106 MB working set size: 2384 MB estimated CPU time remaining: 41296.808367 2) ----------- name: CMS_29066_1487279674.977954_0 WU name: CMS_29066_1487279674.977954 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:26:16 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 395.630000 current CPU time: 403.510000 fraction done: 0.363426 swap size: 2038 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 3) ----------- name: CMS_32355_1487280276.285156_0 WU name: CMS_32355_1487280276.285156 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:26:47 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 408.920000 current CPU time: 417.380000 fraction done: 0.363428 swap size: 2102 MB working set size: 2384 MB estimated CPU time remaining: 41301.661496 4) ----------- name: CMS_29064_1487279674.918296_0 WU name: CMS_29064_1487279674.918296 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:26:47 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 381.760000 current CPU time: 389.480000 fraction done: 0.363442 swap size: 2044 MB working set size: 2384 MB estimated CPU time remaining: 41300.772620 5) ----------- name: CMS_23363_1487278171.671498_0 WU name: CMS_23363_1487278171.671498 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:27:13 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 396.680000 current CPU time: 404.270000 fraction done: 0.363426 swap size: 2086 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 6) ----------- name: CMS_25608_1487278773.168180_0 WU name: CMS_25608_1487278773.168180 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:27:23 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 386.910000 current CPU time: 394.700000 fraction done: 0.363441 swap size: 2050 MB working set size: 2384 MB estimated CPU time remaining: 41300.811549 7) ----------- name: CMS_32352_1487280276.171441_0 WU name: CMS_32352_1487280276.171441 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:27:23 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 415.360000 current CPU time: 424.130000 fraction done: 0.363426 swap size: 2080 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 8) ----------- name: CMS_25606_1487278773.131620_0 WU name: CMS_25606_1487278773.131620 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:27:44 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 395.890000 current CPU time: 404.460000 fraction done: 0.363426 swap size: 2112 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 9) ----------- name: CMS_32354_1487280276.258223_0 WU name: CMS_32354_1487280276.258223 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:27:44 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 395.600000 current CPU time: 404.100000 fraction done: 0.363426 swap size: 2092 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 10) ----------- name: CMS_32353_1487280276.203482_0 WU name: CMS_32353_1487280276.203482 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:28:00 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 393.290000 current CPU time: 402.150000 fraction done: 0.363426 swap size: 2082 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 11) ----------- name: CMS_25610_1487278773.246978_0 WU name: CMS_25610_1487278773.246978 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:28:00 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 393.480000 current CPU time: 401.670000 fraction done: 0.363458 swap size: 2090 MB working set size: 2384 MB estimated CPU time remaining: 41299.721542 12) ----------- name: CMS_1155_1487280578.765914_0 WU name: CMS_1155_1487280578.765914 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:28:12 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 401.000000 current CPU time: 408.970000 fraction done: 0.363441 swap size: 2112 MB working set size: 2384 MB estimated CPU time remaining: 41300.811549 13) ----------- name: CMS_24478_1487278472.511551_0 WU name: CMS_24478_1487278472.511551 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:28:12 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 388.320000 current CPU time: 396.160000 fraction done: 0.363426 swap size: 2118 MB working set size: 2384 MB estimated CPU time remaining: 41301.817211 14) ----------- name: CMS_32358_1487280276.388765_0 WU name: CMS_32358_1487280276.388765 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:28:27 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 391.700000 current CPU time: 399.670000 fraction done: 0.363459 swap size: 2032 MB working set size: 2384 MB estimated CPU time remaining: 41299.702077 15) ----------- name: CMS_32357_1487280276.359628_0 WU name: CMS_32357_1487280276.359628 project URL: https://lhcathome.cern.ch/lhcathome/ report deadline: Sat Mar 18 23:28:27 2017 ready to report: no got server ack: no final CPU time: 0.000000 state: downloaded scheduler state: scheduled exit_status: 0 signal: 0 suspended via GUI: no active_task_state: EXECUTING app version num: 4760 checkpoint CPU time: 410.610000 current CPU time: 417.880000 fraction done: 0.363441 swap size: 2016 MB working set size: 2384 MB estimated CPU time remaining: 41300.811549 root@quinkana3:~# boinccmd --version boinccmd, built from BOINC 7.6.33 root@quinkana3:~# boinc --version 7.6.33 x86_64-pc-linux-gnu root@quinkana3:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.10 Release: 16.10 Codename: yakkety root@quinkana3:~# |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,524,756 RAC: 3,821 |
It seems you have not the right ports enabled in your firewall. Read http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use |
Send message Joined: 10 Feb 17 Posts: 6 Credit: 39,117 RAC: 0 |
I'm not sure I can change firewall settings for our network. Is there any way to test if the ports specified in http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use are usable or not? How does the firewall setting influence the CPU load? |
Send message Joined: 10 Feb 17 Posts: 6 Credit: 39,117 RAC: 0 |
Sorry I wanted to ask how does the firewall setting influence the used disk space (not the CPU load)? |
Send message Joined: 14 Jan 10 Posts: 1429 Credit: 9,524,756 RAC: 3,821 |
Sorry I wanted to ask how does the firewall setting influence the used disk space (not the CPU load)? It looks like BOINC is only checking the available disk space when requesting new work. 1 running CMS-task easily uses >3GB, so starting many of them eats your free disk space. I'm not quite sure what's happening on your machine, but the VM's seem to start until the moment, where it should access CERN's Virtual Machine File System and don't get a connection, because of port 3125 blocked. |
Send message Joined: 10 Feb 17 Posts: 6 Credit: 39,117 RAC: 0 |
I have captured network traffic between my computer and LHC@home with tcpdump 4.7.4. The measurement was started when boincmgr showed no active tasks and no new tasks were allowed. Then Projects/Allow new tasks were executed, then Projects/Update. At this point some new Work Units were scheduled for my computer. After this the Projects/No new tasks were executed. I have analyzed the network traffic with Wireshark 2.0.4. I found successful communication streams for TCP port 413 (which is currently not listed on page http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use ) and TCP port 80. I can confirm that I saw failed connection attepts to TCP port 3125. However, "CMS Simulation 47.60 (vbox64) keeps running after this, possibly doing no useful computation for 18 hours? OK: lhcathome.cern.ch:413 (128.142.136.216:413) OK: cernvm.cern.ch:80 (128.142.132.139:80) SYN -> RST,ACK: cmsextproxy.cern.ch:3125 (128.142.168.202:3125) SYN -> RST,ACK: cmsextproxy.cern.ch:3125 (128.142.168.203:3125) SYN -> RST,ACK: cmsextproxy.fnal.gov:3125 (131.225.205.133:3125) SYN -> RST,ACK: cmsextproxy.fnal.gov:3125 (131.225.205.134:3125) I have tried to open TCP connection to these hosts to port 3125 with NetCat and I'm getting connection refused from the remote end. root@quinkana3:~# nc -v cmsextproxy.cern.ch 3125 nc: connect to cmsextproxy.cern.ch port 3125 (tcp) failed: Connection refused nc: connect to cmsextproxy.cern.ch port 3125 (tcp) failed: Connection refused root@quinkana3:~# nc -v nc -v cmsextproxy.cern.ch 3125 nc: port number invalid: cmsextproxy.cern.ch root@quinkana3:~# nc -v cmsextproxy.fnal.gov 3125 nc: connect to cmsextproxy.fnal.gov port 3125 (tcp) failed: Connection refused nc: connect to cmsextproxy.fnal.gov port 3125 (tcp) failed: Connection refused root@quinkana3:~# |
©2025 CERN