1) Message boards : Number crunching : Did cvmfs download ~150GB of data two days ago? (Message 46567)
Posted 31 Mar 2022 by wujj123456
Did you check whether your squid correctly rejects requests initiated from outside your LAN?
I suspect you intercept all HTTP traffic from inside your LAN directly at the router and force it through squid, right?
This means at least traffic to destination port 80.
Are you aware that some CVMFS/Frontier servers use ports 8000 or 8080.
It would be worth to also route them through squid.

Yes, the squid proxy only listens on internal interfaces and port 80. I can put a monitoring rule to check how much traffic there is on 8000 and 8080, but at least from what I see, I don't think there are major traffic not captured by the current setup for vbox WUs.

You may try out the tuning options from my HowTo and reduce this to 256 MB.
This would leave more RAM for other use on the squid box, e.g. for disk cache.
4-10 GB is suggested to be the CVMFS disk cache size.

Good to know. I intend to capture system update and Steam updates too, which is why it's large. Though they are rare enough so most of time LHC is just having a great hit rate. :-)

If you run Theory vbox each VM will set up it's own CVMFS cache (meanwhile old and degraded).
Hence, each task will send out lots of update requests.
They get all lost when the VM shuts down.
Your squid should cover most of them but its more efficient to run Theory native and keep the data in the local CVMFS cache on the crunching box.

That's what I thought, and also native consumes much less memory. However, if such big downloads happen often enough, it would change the balance. Thus my question trying to understand what happened and how likely or often could it happen again.
2) Message boards : Number crunching : Did cvmfs download ~150GB of data two days ago? (Message 46564)
Posted 31 Mar 2022 by wujj123456
Large downloads happen from time to time although 150 GB within 1 h is very unusual.
CVMFS acts like a cache and tries to use as much data as possible from it's local store.

This reminded me of a few interesting details. The server only had 70-80G available space left and it was not filled up AFAIK. There is no way it could have stored 150GB of data for sure. Meanwhile, the Squid cache I configured is 32GB on disk but 99%+ of hit bytes are served from the 4GB memory cache. It doesn't seem that I even need more than 4GB of data from cvmfs, assuming the vbox theory workload is similar to native other than setup.

I'm pretty curious what this download is actually doing. It kinda feels like a bug TBH...
3) Message boards : Number crunching : Did cvmfs download ~150GB of data two days ago? (Message 46563)
Posted 31 Mar 2022 by wujj123456
Thanks for the reply.

For the data cap, I mostly need to understand how much data I allocate for BOINC. Regarding ATLAS, it was running on the other Windows machine and I have set concurrency limit. Its usage is indeed high but very predictable and I've set aside enough for that. The server I mentioned here is S8026 in my list of computers, which only runs native Theory. Usually it has pretty low usage, but this download caught me off-guard a bit.

I actually have Squid setup at my router directly and the hit rate is superb like 99%+ in terms of bytes for my Windows machine running both ATLAS and Theory in vbox. I didn't observe similar excessive download during the same period from the vbox WUs. Does that mean I might be better off forgoing the native installation but fully rely on Squid caching if I want predictable bandwidth usage?
4) Message boards : Number crunching : Did cvmfs download ~150GB of data two days ago? (Message 46561)
Posted 31 Mar 2022 by wujj123456
I setup native apps for theory application, and thus installed cvmfs. I noticed just now that from 2022-03-29 01:16 GMT-7 to 2022-03-29 02:46 GMT-7 (accurate to a minute or two), my server that runs LHC downloaded at full speed for more than an hour, totaling around 150GB of data.

From logging on my router, I can see the source of traffic was all from 2606:4700:3033::6815:48a2, which is a cloudflare address. Then I retrieved syslog for my system and cvmfs related logs stood out: https://pastebin.com/rTVx9r3C. s1asgc-cvmfs.openhtc.io resolves to that exact address: https://pastebin.com/YfiR9SKB

Unfortunately I have data cap from ISP so I need to be a bit more careful about such incidents.

I've been running native theory on the same server for a year or two now and this is the first time I noticed such thing happening. I haven't touched its setup for quite a while, so I am fairly confident nothing should have changed on my end.

Was this some one-time big update? A bug? Or is it expected from time to time?
5) Message boards : ATLAS application : VM did not power off when requested (Message 43712)
Posted 25 Nov 2020 by wujj123456
Do you have the same problem of disconnection with the Virtualbox 6.1.12 from the Boinc-Homepage instead of the 6.1.16?

I haven't tried. Once the current pending work finishes I can switch to 6.1.12 and check if it helps.
6) Message boards : ATLAS application : VM did not power off when requested (Message 43699)
Posted 25 Nov 2020 by wujj123456
Nvm, that parsing is just for that line, so it's not wrong. It's not clear how it's not finding all the state transition logs then. It should have parsed line-by-line and return last state... :-(
7) Message boards : ATLAS application : VM did not power off when requested (Message 43698)
Posted 25 Nov 2020 by wujj123456
Thanks. I grabbed output of https://lhcathome.cern.ch/lhcathome/result.php?resultid=289380496. VBox.log output: https://pastebin.com/YZyasHCf

The power off actually executed successfully right away, but the task output still came after 5 minutes saying the VM failed to power off. The task log also had first state change:
2020-11-24 14:49:26 (13272): VM state change detected. (old = 'PoweredOff', new = 'Running')

Clearly the poll on https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_vboxmanage.cpp#L1253 failed to see the state change to "poweredoff", and it doesn't log state change. From the log "VM did not power off when requested", it's clear "online" was still set to true after 5 minutes.

I think the bug is in https://github.com/BOINC/boinc/blob/master/samples/vboxwrapper/vbox_common.cpp#L450, read_vm_log function itself.

There is no more "Guest Log" in VBox.log after "02:32:35.088629 VMMDev: Guest Log: *** Success! Shutting down the machine. ***". Then if we look at the body of while loop, the first line.find(console) would have found Console: Machine state changed to 'Stopping', which still sets online to true. Then second line_pos = line.find("Guest Log:") would seek to the end of file, saving the cursor and miss out all state changes in between.

I could be wrong, but if not, how is this working in Linux? Is this wrapper code here only specific to windows or the output of VBox.log slightly different?
8) Message boards : ATLAS application : VM did not power off when requested (Message 43696)
Posted 24 Nov 2020 by wujj123456
Thanks for the code pointer and explanation. Is the kernel log of VM kept somewhere in output directory? Since this only happens on Windows, it's probably an issue with VirtualBox. I kinda want to verify whether the VM received the shutdown command at least.

Given the wrapper is BOINC code, I guess there is nothing a project can do. I haven't tried other VM projects in Windows yet. Let me see if this can be reproduced for VMs from other projects too. If so, it's probably something could be discussed in the BOINC github.
9) Message boards : ATLAS application : VM did not power off when requested (Message 43694)
Posted 24 Nov 2020 by wujj123456
Yes, I realized the start and shutdown time is always this, but I am curious what the VM is doing during that 5 minutes. I mean if my computer always takes more than 5 minutes to shut down and I have to pull the plug each time, I probably would want to figure out why.

Despite semi-related discussion in many threads, I am not able to find details for this specific 5 minutes. If it's not doing anything useful and given we are terminating VM at the end anyway, I wonder if we can just shorten the timeout, clean up the VM sooner and we can all get more work done.
10) Message boards : ATLAS application : VM did not power off when requested (Message 43692)
Posted 24 Nov 2020 by wujj123456
It seems that all my ATLAS tasks waste 5 minutes at the end waiting for VM to power off. For example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=289365725

2020-11-23 15:07:03 (16632): VM Completion File Detected.
2020-11-23 15:07:03 (16632): Powering off VM.
2020-11-23 15:12:05 (16632): VM did not power off when requested.
2020-11-23 15:12:05 (16632): VM was successfully terminated.
2020-11-23 15:12:05 (16632): Deregistering VM. (boinc_683247da3d5f3b86, slot#32)

I've checked a dozen results so far and they all have same message at the end. How do I debug why VM isn't powering off when requested?

Given this doesn't seem to affect results at all, why can't we just terminate the VM directly?

PS: I am on Windows 10 64-bit Version 20H2 (OS Build 19042.630). Virtualbox version is 6.1.16 r140961. BOINC version 7.16.11.
11) Message boards : Sixtrack Application : please, remove non-optimized application SixTrack for 32 bit systems (Message 42642)
Posted 28 May 2020 by wujj123456
I would say just remove non-optimized apps altogether... On my Ryzen 3 that's perfectly capable of doing avx, I still get lots of non-optimized WUs, which takes 50-100% longer to finish for same credit. (I am using credit/hr as an approximation as efficiency since it's same app. Feel free to correct the assumption if that's invalid.) I really doubt there are many system not capable of doing sse2 these days and most should be able to do avx too. It's also interesting that all apps are at least sse2 for Linux and apparently that's not a concern.

I wonder if I could use app_info.xml and force map non-optimized app to the avx application? Would it generate different results failing validation? Have anyone tried that already?
12) Message boards : Theory Application : (Native) Theory - Sherpa looooooong runners (Message 41422)
Posted 29 Jan 2020 by wujj123456
Finally finished: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259641514

===> [runRivet] Mon Jan 20 15:24:51 UTC 2020 [boinc pp jets 8000 800 - sherpa 1.4.1 default 100000 16]

Run time 4 days 12 hours 57 min 5 sec
CPU time 4 days 12 hours 21 min 56 sec

It actually finishes? I have a few of these 1d+ or 2d+ WUs as well at 100% progress. I felt it will never finish...
13) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41410)
Posted 28 Jan 2020 by wujj123456
The BOINC data directory must be mounted inside the container, and with a default installation this is /var/lib/boinc-client/slots. If there are problems mounting /var you could try a different data directory or install BOINC in a different place. For example on my desktop I run boinc-client from my home directory because the root partition is too small.

Thanks for the reply. Looks like it's a bind mount and I should be able to easily reproduce this without wasting WUs. However, it does seem to work locally, assuming seeing the error message means container has been setup properly with remount.

$ sudo su -l boinc -s /bin/bash -c '/cvmfs/atlas.cern.ch/repo/containers/sw/singularity/x86_64-el7/current/bin/singularity exec --pwd /var/lib/boinc-client/slots/32 -B /cvmfs,/var /cvmfs/atlas.cern.ch/repo/containers/images/singularity/x86_64-centos7.img sh ls'
INFO: Convert SIF file to sandbox...
/usr/bin/ls: /usr/bin/ls: cannot execute binary file
INFO: Cleaning up image...

Now i wonder if it's some setup in the default unit file came with Ubuntu 19.10: https://pastebin.com/akEe8cyY. I am not that familiar with systemd unit files, but nothing looks suspicious after searching the man page. Clearly the symlink /var/lib/boinc should have been resolved given all WUs read/write /var/lib/boinc-client/ without a problem. Any ideas where I should look next?
14) Message boards : ATLAS application : error on Atlas native: 195 (0x000000C3) EXIT_CHILD_FAILED (Message 41367)
Posted 27 Jan 2020 by wujj123456

A bit of research on the stderr error message may be significant.
"container creation failed: mount ->/var error: can't remount /var: operation not permitted"
It seems to have something to do with how the local storage is mounted.

I am running into the same. Is this /var on host filesystem? I probably don't want singularity to remount my /var on host system, but if it's trying to mount due to some missing flags, I can probably check what they do and add them so that remount becomes a noop and succeeds.


If I couldn't resolve this, is there a way to disable native atlas while allowing native theory without refusing atlas work entirely?
15) Message boards : Theory Application : Unable to start VM on some WUs (Message 41258)
Posted 14 Jan 2020 by wujj123456
I configured automatic upgrade, but not automatically reboot. I thought the new kernel and components would only be in effect after I reboot. Let me turn auto update off to ensure the vboxdrv is always in sync with kernel to see if the result improves.
16) Message boards : Theory Application : Unable to start VM on some WUs (Message 41245)
Posted 14 Jan 2020 by wujj123456
I checked a few failed tasks and they all failed with messages like this.

2020-01-12 15:31:37 (2138):
Command: VBoxManage -q startvm "boinc_74513d880c5d6ae6" --type headless
Exit Code: 1
WARNING: The character device /dev/vboxdrv does not exist.
Please install the virtualbox-dkms package and the appropriate
headers, most likely linux-headers-generic.

You will not be able to start VMs until this problem is fixed.
VBoxManage: error: The virtual machine 'boinc_74513d880c5d6ae6' has terminated unexpectedly during startup with exit code 1 (0x1)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component MachineWrap, interface IMachine
Waiting for VM "boinc_74513d880c5d6ae6" to power on...\

Example failures:

However, the /dev/vboxdrv exists and the mentioned packages are also installed on the host. The host has valid results for same application as well: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10595991&offset=0&show_names=0&state=0&appid=13

I couldn't find any smoking gun as of why it fails some WUs but not others. Could /dev/vboxdrv temporarily become inaccessible for some reason I should be aware of?

©2022 CERN