Questions and Answers : Unix/Linux : All disk space is consumed even if disk limit is set
Message board moderation

To post messages, you must log in.

AuthorMessage
Marton Nemeth

Send message
Joined: 10 Feb 17
Posts: 6
Credit: 39,117
RAC: 0
Message 28897 - Posted: 17 Feb 2017, 5:31:37 UTC

I'm running Boinc 7.6.33 and Oracle VirtualBox 5.1.14 r112924 (Qt5.6.1) on Ubuntu 16.10. There is a 128GB partition allocated only for Boinc which is available under /var/lib/boinc-client . I have set up the maximum disk usage in boincmgr at Options/Computing Preferences/Disk and Memory/Disk/Use no more than: 40GB. When I start LHC@home 3..5 tasks are scheduled every 10..15 seconds. For example the task "CMS Simulation 47.60 (vbox64)" starts but it doesn't immediately uses the 1.7GB space under /var/lib/boinc-client/slots per Work Unit. It takes time for the WU to allocate that disk space. In the mean time more and more WUs are scheduled for the computer and when these tasks actually start running all the 128GB disk space is consumed even though the 40GB disk limit is set. Finally bo hainc stops working and all the WUs are also stopped when no more disk space is left and the error message is "Couldn't write state file: fwrite() failed; giving up" in /var/log/boinc.log .

I have the feeling that Boinc doesn't take the real disk need into account when scheduling new WUs, e.g. for CMS Simulation for each scheduled WU 1.7GB disk space should be allocated from the allowed 40GB even if the actual usage doesn't reached the 1.7GB usage.

I can WORKAROUND this problem manually if I wait for a couple of tasks to be scheduled and then use the Projects/No new tasks button before disk space is overallocated. This way some computation can be done but no new tasks are automatically scheduled to the computer when the WUs are finished. The below state show the situation

Do you know how to fix this problem properly?

root@quinkana3:~# cat /etc/boinc-client/cc_config.xml
<!--
This is a minimal configuration file cc_config.xml of the BOINC core client.
For a complete list of all available options and logging flags and their
meaning see: https://boinc.berkeley.edu/wiki/client_configuration
-->
<cc_config>
<log_flags>
<task>1</task>
<file_xfer>1</file_xfer>
<sched_ops>1</sched_ops>
</log_flags>
</cc_config>
root@quinkana3:~# cat /etc/boinc-client/global_prefs_override.xml
<global_preferences>
<run_on_batteries>0</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>1</run_gpu_if_user_active>
<suspend_cpu_usage>0.000000</suspend_cpu_usage>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>0</leave_apps_in_memory>
<confirm_before_connecting>1</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.010000</work_buf_min_days>
<work_buf_additional_days>0.020000</work_buf_additional_days>
<max_ncpus_pct>100.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes>
<disk_interval>600.000000</disk_interval>
<disk_max_used_gb>40.000000</disk_max_used_gb>
<disk_max_used_pct>100.000000</disk_max_used_pct>
<disk_min_free_gb>0.000000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct>
<max_bytes_sec_up>0.000000</max_bytes_sec_up>
<max_bytes_sec_down>0.000000</max_bytes_sec_down>
<cpu_usage_limit>100.000000</cpu_usage_limit>
<daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb>
<daily_xfer_period_days>0</daily_xfer_period_days>
</global_preferences>
root@quinkana3:~# boinccmd --get_host_info
timezone: 3600
domain name: quinkana3
IP addr: 192.168.2.10
#CPUS: 112
CPU vendor: GenuineIntel
CPU model: 06/55 [Family 6 Model 85 Stepping 2]
CPU FP OPS: 2613250214.725667
CPU int OPS: 34274369110.431229
CPU mem BW: 1000000000.000000
OS name: Linux
OS version: 4.8.0-37-generic
mem size: 539083247616.000000
cache size: 40370176.000000
swap size: 3016749056.000000
disk size: 125488705536.000000
disk free: 89312739328.000000
root@quinkana3:~# boinccmd --get_tasks

======== Tasks ========
1) -----------
name: CMS_29065_1487279674.946249_0
WU name: CMS_29065_1487279674.946249
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:26:16 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 391.580000
current CPU time: 399.710000
fraction done: 0.363503
swap size: 2106 MB
working set size: 2384 MB
estimated CPU time remaining: 41296.808367
2) -----------
name: CMS_29066_1487279674.977954_0
WU name: CMS_29066_1487279674.977954
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:26:16 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 395.630000
current CPU time: 403.510000
fraction done: 0.363426
swap size: 2038 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
3) -----------
name: CMS_32355_1487280276.285156_0
WU name: CMS_32355_1487280276.285156
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:26:47 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 408.920000
current CPU time: 417.380000
fraction done: 0.363428
swap size: 2102 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.661496
4) -----------
name: CMS_29064_1487279674.918296_0
WU name: CMS_29064_1487279674.918296
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:26:47 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 381.760000
current CPU time: 389.480000
fraction done: 0.363442
swap size: 2044 MB
working set size: 2384 MB
estimated CPU time remaining: 41300.772620
5) -----------
name: CMS_23363_1487278171.671498_0
WU name: CMS_23363_1487278171.671498
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:27:13 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 396.680000
current CPU time: 404.270000
fraction done: 0.363426
swap size: 2086 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
6) -----------
name: CMS_25608_1487278773.168180_0
WU name: CMS_25608_1487278773.168180
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:27:23 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 386.910000
current CPU time: 394.700000
fraction done: 0.363441
swap size: 2050 MB
working set size: 2384 MB
estimated CPU time remaining: 41300.811549
7) -----------
name: CMS_32352_1487280276.171441_0
WU name: CMS_32352_1487280276.171441
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:27:23 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 415.360000
current CPU time: 424.130000
fraction done: 0.363426
swap size: 2080 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
8) -----------
name: CMS_25606_1487278773.131620_0
WU name: CMS_25606_1487278773.131620
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:27:44 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 395.890000
current CPU time: 404.460000
fraction done: 0.363426
swap size: 2112 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
9) -----------
name: CMS_32354_1487280276.258223_0
WU name: CMS_32354_1487280276.258223
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:27:44 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 395.600000
current CPU time: 404.100000
fraction done: 0.363426
swap size: 2092 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
10) -----------
name: CMS_32353_1487280276.203482_0
WU name: CMS_32353_1487280276.203482
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:28:00 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 393.290000
current CPU time: 402.150000
fraction done: 0.363426
swap size: 2082 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
11) -----------
name: CMS_25610_1487278773.246978_0
WU name: CMS_25610_1487278773.246978
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:28:00 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 393.480000
current CPU time: 401.670000
fraction done: 0.363458
swap size: 2090 MB
working set size: 2384 MB
estimated CPU time remaining: 41299.721542
12) -----------
name: CMS_1155_1487280578.765914_0
WU name: CMS_1155_1487280578.765914
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:28:12 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 401.000000
current CPU time: 408.970000
fraction done: 0.363441
swap size: 2112 MB
working set size: 2384 MB
estimated CPU time remaining: 41300.811549
13) -----------
name: CMS_24478_1487278472.511551_0
WU name: CMS_24478_1487278472.511551
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:28:12 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 388.320000
current CPU time: 396.160000
fraction done: 0.363426
swap size: 2118 MB
working set size: 2384 MB
estimated CPU time remaining: 41301.817211
14) -----------
name: CMS_32358_1487280276.388765_0
WU name: CMS_32358_1487280276.388765
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:28:27 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 391.700000
current CPU time: 399.670000
fraction done: 0.363459
swap size: 2032 MB
working set size: 2384 MB
estimated CPU time remaining: 41299.702077
15) -----------
name: CMS_32357_1487280276.359628_0
WU name: CMS_32357_1487280276.359628
project URL: https://lhcathome.cern.ch/lhcathome/
report deadline: Sat Mar 18 23:28:27 2017
ready to report: no
got server ack: no
final CPU time: 0.000000
state: downloaded
scheduler state: scheduled
exit_status: 0
signal: 0
suspended via GUI: no
active_task_state: EXECUTING
app version num: 4760
checkpoint CPU time: 410.610000
current CPU time: 417.880000
fraction done: 0.363441
swap size: 2016 MB
working set size: 2384 MB
estimated CPU time remaining: 41300.811549
root@quinkana3:~# boinccmd --version
boinccmd, built from BOINC 7.6.33
root@quinkana3:~# boinc --version
7.6.33 x86_64-pc-linux-gnu
root@quinkana3:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.10
Release: 16.10
Codename: yakkety
root@quinkana3:~#
ID: 28897 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,490,244
RAC: 1,961
Message 28899 - Posted: 17 Feb 2017, 7:01:33 UTC

It seems you have not the right ports enabled in your firewall.

Read http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use
ID: 28899 · Report as offensive     Reply Quote
Marton Nemeth

Send message
Joined: 10 Feb 17
Posts: 6
Credit: 39,117
RAC: 0
Message 28900 - Posted: 17 Feb 2017, 8:49:16 UTC - in response to Message 28899.  

I'm not sure I can change firewall settings for our network. Is there any way to test if the ports specified in http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use are usable or not?

How does the firewall setting influence the CPU load?
ID: 28900 · Report as offensive     Reply Quote
Marton Nemeth

Send message
Joined: 10 Feb 17
Posts: 6
Credit: 39,117
RAC: 0
Message 28901 - Posted: 17 Feb 2017, 8:50:17 UTC - in response to Message 28900.  

Sorry I wanted to ask how does the firewall setting influence the used disk space (not the CPU load)?
ID: 28901 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,490,244
RAC: 1,961
Message 28906 - Posted: 17 Feb 2017, 13:09:07 UTC - in response to Message 28901.  

Sorry I wanted to ask how does the firewall setting influence the used disk space (not the CPU load)?

It looks like BOINC is only checking the available disk space when requesting new work.
1 running CMS-task easily uses >3GB, so starting many of them eats your free disk space.

I'm not quite sure what's happening on your machine, but the VM's seem to start until the moment,
where it should access CERN's Virtual Machine File System and don't get a connection, because of port 3125 blocked.
ID: 28906 · Report as offensive     Reply Quote
Marton Nemeth

Send message
Joined: 10 Feb 17
Posts: 6
Credit: 39,117
RAC: 0
Message 28909 - Posted: 18 Feb 2017, 1:38:47 UTC - in response to Message 28906.  

I have captured network traffic between my computer and LHC@home with tcpdump 4.7.4. The measurement was started when boincmgr showed no active tasks and no new tasks were allowed. Then Projects/Allow new tasks were executed, then Projects/Update. At this point some new Work Units were scheduled for my computer. After this the Projects/No new tasks were executed.

I have analyzed the network traffic with Wireshark 2.0.4. I found successful communication streams for TCP port 413 (which is currently not listed on page http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use ) and TCP port 80. I can confirm that I saw failed connection attepts to TCP port 3125. However, "CMS Simulation 47.60 (vbox64) keeps running after this, possibly doing no useful computation for 18 hours?

OK: lhcathome.cern.ch:413 (128.142.136.216:413)
OK: cernvm.cern.ch:80 (128.142.132.139:80)
SYN -> RST,ACK: cmsextproxy.cern.ch:3125 (128.142.168.202:3125)
SYN -> RST,ACK: cmsextproxy.cern.ch:3125 (128.142.168.203:3125)
SYN -> RST,ACK: cmsextproxy.fnal.gov:3125 (131.225.205.133:3125)
SYN -> RST,ACK: cmsextproxy.fnal.gov:3125 (131.225.205.134:3125)

I have tried to open TCP connection to these hosts to port 3125 with NetCat and I'm getting connection refused from the remote end.

root@quinkana3:~# nc -v cmsextproxy.cern.ch 3125
nc: connect to cmsextproxy.cern.ch port 3125 (tcp) failed: Connection refused
nc: connect to cmsextproxy.cern.ch port 3125 (tcp) failed: Connection refused
root@quinkana3:~# nc -v nc -v cmsextproxy.cern.ch 3125
nc: port number invalid: cmsextproxy.cern.ch
root@quinkana3:~# nc -v cmsextproxy.fnal.gov 3125
nc: connect to cmsextproxy.fnal.gov port 3125 (tcp) failed: Connection refused
nc: connect to cmsextproxy.fnal.gov port 3125 (tcp) failed: Connection refused
root@quinkana3:~#
ID: 28909 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : All disk space is consumed even if disk limit is set


©2024 CERN