21)
Message boards :
ATLAS application :
Download sometime between 20 and 50 kBps
(Message 49193)
Posted 15 Jan 2024 by wujj123456 Post: I just checked and I don't see any problem for downloading in the past few days. Consistently same 100-150Mbps just like before. |
22)
Message boards :
ATLAS application :
Thank you and goodbye!
(Message 49181)
Posted 13 Jan 2024 by wujj123456 Post: Given the sporadic nature and different characteristics of each batch, I'm hopeful that the prod jobs are released by human for actual science. Would be good to confirm for sure. |
23)
Message boards :
ATLAS application :
When cvmfs will be available for Ubuntu 22.04 LTS?
(Message 48168)
Posted 1 Jun 2023 by wujj123456 Post: Which guide did you follow? I ran Ubuntu 22.04 on multiple machines and had no problem getting it working with the simple instructions on the official page: https://cvmfs.readthedocs.io/en/stable/cpt-quickstart.html#debian-ubuntu. They also come with all the recommended configurations including Cloudflare's openhtc.io settings, so I didn't have to do any additional changes. I even use these .deb installation to get a copy of /etc/cvmfs to mount with the docker installation on my Arch Linux hosts. Edit: Well, unless something changed in past few months since I haven't re-installed Ubuntu for a while. |
24)
Message boards :
ATLAS application :
Latest ATLAS jobs getting much larger in download size?
(Message 47904)
Posted 26 Mar 2023 by wujj123456 Post: Given there isn't really new science involved here as I thought, could moderator move this thread to ATLAS forum or just delete it? Thanks. |
25)
Message boards :
Number crunching :
Computer Optimization
(Message 47902)
Posted 25 Mar 2023 by wujj123456 Post: Regarding optimization, I honestly don't think there are specific ones related to hardware for the project. Since you are using Windows, you might want to set up the squid proxy, which could save a lot of bandwidth while making cvmfs access faster in your VM. Other general optimization still applies. You have the same Zen 4 CPUs as I do and the first thing is to make sure your EXPO profile is enabled in UEFI, so the memory is running at the frequency you paid for instead of the default DDR5-4800. Then the "Curve Optimizer" in BIOS can get me ~5% more frequency, but it's a trial and error process trying to get the curve down as much as possible. I did it in UEFI, but given you use Windows, you might be able to use AMD's Ryzen Master to do the same without reboot. Linux is generally better here if you want to run LHC, which is native to Linux. You can set up native cvmfs fairly easily on modern distros and avoid paying the VM overhead. Theory would be better on Linux too, but it currently requires some hacking of the start script to get around the cgroupv2 problem, or disable cgroupv2 for the system. sixtracks have native app for Windows, so I assume they are equally good. These are all my experience though and I am not a developer here. |
26)
Message boards :
Number crunching :
Computer Optimization
(Message 47901)
Posted 25 Mar 2023 by wujj123456 Post: AVX512 is AFAIK not used yet "executing command: grep -o 'avx2[^ ]*\|AVX2[^ ]*' /proc/cpuinfo" from the Atlas native Log of this user with a CPU that is at least the same Generation as yours Ryzen 9 7950X 16-Core https://lhcathome.cern.ch/lhcathome/results.php?hostid=10821683&offset=0&show_names=0&state=4&appid= grep -o means exact match, so it won't find avx512 even if it's there. In addition, the output is redirected by the python code logging these anyway, likely because it's checking the output to make decisions. The grep command shown should have returned 32 lines of avx2 since 7950X does support avx2 and have that in feature flags. $ grep -o 'avx2[^ ]*\|AVX2[^ ]*' /proc/cpuinfo | uniq -c 32 avx2 Regarding avx512, you are probably still right anyway though. Given the fact the setup script is only grepping for avx2, it likely means avx512 isn't a concern whatsoever. I did some digging since I was curious after your reply and you happened to point to one of my host. :-P The command that's consuming CPU on my host: /cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasCore/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/bin/athena.py --preloadlib=/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libintlc.so.5:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libimf.so runargs.EVNTtoHITS.py SimuJobTransforms/skeleton.EVGENtoHIT_ISF.py Those python code is likely uninteresting wrappers and I bet bulk of the work is done inside those libs loaded. avx512 instructions use zmm registers so I grepped for those. $ objdump -d /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libimf.so | grep zmm $ objdump -d /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libintlc.so.5 | grep zmm 32dc7: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0 32dd4: 62 d1 7d 48 e7 03 vmovntdq %zmm0,(%r11) 32dda: 62 f1 fe 48 6f 46 01 vmovdqu64 0x40(%rsi),%zmm0 32de8: 62 d1 7d 48 e7 43 01 vmovntdq %zmm0,0x40(%r11) 32def: 62 f1 fe 48 6f 46 02 vmovdqu64 0x80(%rsi),%zmm0 32dfd: 62 d1 7d 48 e7 43 02 vmovntdq %zmm0,0x80(%r11) 32e04: 62 f1 fe 48 6f 46 03 vmovdqu64 0xc0(%rsi),%zmm0 32e12: 62 d1 7d 48 e7 43 03 vmovntdq %zmm0,0xc0(%r11) 32e2e: 62 f1 fe 48 6f 46 fc vmovdqu64 -0x100(%rsi),%zmm0 32e3c: 62 d1 7d 48 e7 43 fc vmovntdq %zmm0,-0x100(%r11) 32e43: 62 f1 fe 48 6f 46 fd vmovdqu64 -0xc0(%rsi),%zmm0 32e51: 62 d1 7d 48 e7 43 fd vmovntdq %zmm0,-0xc0(%r11) 32e58: 62 f1 fe 48 6f 46 fe vmovdqu64 -0x80(%rsi),%zmm0 32e66: 62 d1 7d 48 e7 43 fe vmovntdq %zmm0,-0x80(%r11) 32e6d: 62 f1 fe 48 6f 46 ff vmovdqu64 -0x40(%rsi),%zmm0 32e7b: 62 d1 7d 48 e7 43 ff vmovntdq %zmm0,-0x40(%r11) 32f7b: 62 f1 7c 48 10 46 f9 vmovups -0x1c0(%rsi),%zmm0 32f82: 62 d1 7c 48 29 43 f9 vmovaps %zmm0,-0x1c0(%r11) 32f89: 62 f1 7c 48 10 46 fa vmovups -0x180(%rsi),%zmm0 32f90: 62 d1 7c 48 29 43 fa vmovaps %zmm0,-0x180(%r11) 32f97: 62 f1 7c 48 10 46 fb vmovups -0x140(%rsi),%zmm0 32f9e: 62 d1 7c 48 29 43 fb vmovaps %zmm0,-0x140(%r11) 32fa5: 62 f1 7c 48 10 46 fc vmovups -0x100(%rsi),%zmm0 32fac: 62 d1 7c 48 29 43 fc vmovaps %zmm0,-0x100(%r11) 32fb3: 62 f1 7c 48 10 46 fd vmovups -0xc0(%rsi),%zmm0 32fba: 62 d1 7c 48 29 43 fd vmovaps %zmm0,-0xc0(%r11) 32fc1: 62 f1 7c 48 10 46 fe vmovups -0x80(%rsi),%zmm0 32fc8: 62 d1 7c 48 29 43 fe vmovaps %zmm0,-0x80(%r11) 32fcf: 62 f1 7c 48 10 46 ff vmovups -0x40(%rsi),%zmm0 32fd6: 62 d1 7c 48 29 43 ff vmovaps %zmm0,-0x40(%r11) 330a9: 62 f1 7c 48 10 06 vmovups (%rsi),%zmm0 330af: 62 d1 7c 48 11 03 vmovups %zmm0,(%r11) 330ba: 62 f1 7c 48 10 06 vmovups (%rsi),%zmm0 330c0: 62 d1 7c 48 11 03 vmovups %zmm0,(%r11) 330c6: 62 d1 7c 48 10 40 ff vmovups -0x40(%r8),%zmm0 330cd: 62 d1 7c 48 11 42 ff vmovups %zmm0,-0x40(%r10) 3475f: 62 d2 7d 48 7c c1 vpbroadcastd %r9d,%zmm0 347b0: 62 d1 7d 48 e7 02 vmovntdq %zmm0,(%r10) 347b6: 62 d1 7d 48 e7 42 01 vmovntdq %zmm0,0x40(%r10) 347bd: 62 d1 7d 48 e7 42 02 vmovntdq %zmm0,0x80(%r10) 347c4: 62 d1 7d 48 e7 42 03 vmovntdq %zmm0,0xc0(%r10) 347d9: 62 d1 7d 48 e7 42 fc vmovntdq %zmm0,-0x100(%r10) 347e0: 62 d1 7d 48 e7 42 fd vmovntdq %zmm0,-0xc0(%r10) 347e7: 62 d1 7d 48 e7 42 fe vmovntdq %zmm0,-0x80(%r10) 347ee: 62 d1 7d 48 e7 42 ff vmovntdq %zmm0,-0x40(%r10) 34888: 62 d1 7c 48 29 42 f9 vmovaps %zmm0,-0x1c0(%r10) 3488f: 62 d1 7c 48 29 42 fa vmovaps %zmm0,-0x180(%r10) 34896: 62 d1 7c 48 29 42 fb vmovaps %zmm0,-0x140(%r10) 3489d: 62 d1 7c 48 29 42 fc vmovaps %zmm0,-0x100(%r10) 348a4: 62 d1 7c 48 29 42 fd vmovaps %zmm0,-0xc0(%r10) 348ab: 62 d1 7c 48 29 42 fe vmovaps %zmm0,-0x80(%r10) 348b2: 62 d1 7c 48 29 42 ff vmovaps %zmm0,-0x40(%r10) 3494c: 62 d1 7c 48 11 02 vmovups %zmm0,(%r10) 34957: 62 d1 7c 48 11 02 vmovups %zmm0,(%r10) 3495d: 62 d1 7c 48 11 40 ff vmovups %zmm0,-0x40(%r8) Other than a small chunk of code that's using avx512 for memcpy and memset, there are no avx512 compute instructions AFAIC. It's also not a given these instructions are actually executed even if they are in the library assembly. PS: There are probably more direct way of confirming this by instrumenting with some tools like perf, but I don't do profiling enough to know how. :-( |
27)
Message boards :
ATLAS application :
Latest ATLAS jobs getting much larger in download size?
(Message 47900)
Posted 25 Mar 2023 by wujj123456 Post: Thanks. Sorry missed that. Didn't expect that thread to contain the information I was looking for. Hopefully the returned results are still useful even though the WUs weren't intended for BOINC. :-) |
28)
Message boards :
ATLAS application :
Latest ATLAS jobs getting much larger in download size?
(Message 47898)
Posted 25 Mar 2023 by wujj123456 Post: I noticed today that since Mar 22, my boinc download increased dramatically. After correlating detailed traffic data on my hosts with boinc log, I confirmed they are from ATLAS downloads. Each ATLAS WU now downloads around 1.1GB of data for the *.pool.root.1 file. They used to be 250M each and I still have some of the smaller ones as comparison. The WU disk usage also reflects the change, likely after decompression. Example old WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=389847635 Example new WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=390698043 I remember reading about LHC upgrades last year and I wonder if this is the result of the upgrade? Would be interesting to know what's added here. I wonder if it would make sense to re-introduce long simulation WUs to balance out the network and compute ratio, though it's not really causing a problem for me. |
29)
Message boards :
ATLAS application :
No HITS File But Still Granted Credit?
(Message 47637)
Posted 1 Jan 2023 by wujj123456 Post: In the past few users complained about that and suggested not to reward the user in any case of an error. In this case, does the result show as "Completed and validated" or "Error while computing"? Even if the team decide to grant credit, I would like to get some signal that things were wrong. Unless the user is familiar with the internals of WU, the result status and credits are the only signal available to us to determine if anything is off. That's a signal common across all BOINC projects too. I know this is ATLAS forum but I feel my experience with Theory is very relevant to this discussion. The first time I started running native Theory, I thought it's unexpected for some WU to run very long given the average behavior, so I had a cron to kill the worker process (e.g., Sherpa, rivetvm.exe, etc) if they run for more than 12 (or 24?) hours. I didn't abort the task directly simply because finding the offending long-running process with ps is simpler. The results were "Completed and validated" so I assumed my action had no side effects. If the WUs failed, I certainly wouldn't have continued to do this. Later I tried same with another machine running vbox and killing the vboxwrapper failed the task, which leads me to look closer. Finally I came to the forum and soon learnt it's normal for some Theory WUs to run long. Needless to say I don't kill any processes afterwards, but that's not before I generated a few dozen bogus results. Thus I prefer some clear way of knowing my results were bad, whether I get credit or not. In those cases the project grants credit although it doesn't get it's own reward (the HITS file). Hmm, even for people going after credits, the fact we all picked BOINC and a specific project, instead of some pointless workload should mean the science results are at least remotely relevant for us. I don't know how people would feel about getting credits while not actually helping. I personally would rather get not credit for errors so I can investigate further. |
30)
Message boards :
Theory Application :
Theory native fails with \"mountpoint for cgroup not found\"
(Message 47620)
Posted 26 Dec 2022 by wujj123456 Post: Thanks for looking into this. There's a `/cvmfs/grid.cern.ch/vc/containers/runc.new` that also seems to work fine. Perfect. That's an easier patch to maintain. Perhaps someone is aware of the problem and testing fix already. I can only hope a fix for everyone is coming soon. |
31)
Message boards :
Theory Application :
Theory native fails with \"mountpoint for cgroup not found\"
(Message 47618)
Posted 26 Dec 2022 by wujj123456 Post: Just to be sure, from what I can find, none of the LHC@Home application source code is open, right? Turns out this question is irrelevant. This specific issue is only with the runc on cvmfs. The runc came with my distro had no problem starting containers on cgroupv2. So I hacked around in the cranky script that used to start native theory tasks and it now works without suspend/resume. Note that this is not tested in any other environments than my own, though it probably should work so long as the runc on distro can cope with cgroupv2. I ran two tasks and they both finished fine: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374799901 https://lhcathome.cern.ch/lhcathome/result.php?resultid=374802093 Note the WARNING about runc in the output, which is what I added in the patch: https://pastebin.com/vpLvagEr. The link will expire in a week in case the patch has undesirable side effects. I can upload a permanent one if admin approve this. Hopefully just swapping the runc version doesn't have any side effects (like bogus results), but I'd like to get confirmation first. For the real fix, we may not even need any patch, if we can upgrade the runc in cvmfs. I don't know if the one in cvmfs is forked, but it seems to be old for sure. $ runc -v runc version 1.1.0-0ubuntu1.1 spec: 1.0.2-dev go: go1.18.1 libseccomp: 2.5.3 $ /cvmfs/grid.cern.ch/vc/containers/runc -v runc version spec: 1.0.0 If we go that route, obviously, the newer runc needs to be tested against other setup to ensure they didn't break cgroupv1, or any other workload. Suspend/resume on cgroupv2 would need additional work, but cranky has test for cgroup structure already. Since cgroupv2 will never have a matching structure, suspend would skip just fine, same as on cgroupv1 systems without the right cgroup structure. |
32)
Message boards :
Theory Application :
Theory native fails with \"mountpoint for cgroup not found\"
(Message 47617)
Posted 26 Dec 2022 by wujj123456 Post: Most distros use cgroup v2 and this should not have been taken out of beta with only v1 support. I have nearly 100 failed tasks now just from this application. This isn't fair honestly. The wider adoption of cgroupv2 happened far after this application was released and it still works in vbox. Ideally, cgroupv2 should have been supported before mainstream distros start to switch over. At this point, I just hope we can get some quick hacks in if the cgroup part is not crucial for the application itself. Even when my system was on cgroupv1, I never bothered to setup suspend and resume. If the current cgroupv2 failure is just for that, I really hope I could just bypass that part. Just to be sure, from what I can find, none of the LHC@Home application source code is open, right? |
33)
Message boards :
Theory Application :
Theory native fails with \"mountpoint for cgroup not found\"
(Message 47598)
Posted 21 Dec 2022 by wujj123456 Post: ... for newer kernels, cgroup has changed from V1 to V2. Sorry for digging out the old thread. I wonder if I am willing to forgo suspend/resume, could I make native Theory native work under cgroups v2? I suppose that means I could lose work or even end up with errors occasionally, but I never suspend/pause work on my server and I've configured task switch time to effectively never switch. So not able to suspend and resume doesn't seem to worth the immediate failure I am getting. Example WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=374269026 |
34)
Message boards :
ATLAS application :
Native Atlas Guide
(Message 47594)
Posted 21 Dec 2022 by wujj123456 Post: Found the guide here : https://apptainer.org/docs/admin/main/installation.html Looks like there are two versions, apptainer and apptainer-suid. Curious which one did you install? |
35)
Message boards :
ATLAS application :
Question for the comment in Ubuntu's boinc-client.service unit file about Atlas
(Message 47587)
Posted 12 Dec 2022 by wujj123456 Post: Well, I have some answer now. Vbox doesn't work with these options on, even for Theory. The WUs error out right away not able to manage VM, just like when ProtectSystem is set to strict (default on Ubuntu 22.04). So regardless whether these options are specific to native ATLAS or not, I am not going to enable them. 🤣 |
36)
Message boards :
ATLAS application :
Question for the comment in Ubuntu's boinc-client.service unit file about Atlas
(Message 47586)
Posted 10 Dec 2022 by wujj123456 Post: I just realized the boinc-client.service unit file shipped with Ubuntu 22.04 contained the following comments specific to Atlas. I am not aware of other Atlas applications among BOINC projects, so I assume this is referring to LHC's ATLAS. [Service] Type=simple ProtectHome=true ProtectSystem=full ProtectControlGroups=true ReadWritePaths=-/var/lib/boinc -/etc/boinc-client Nice=10 User=boinc WorkingDirectory=/var/lib/boinc ExecStart=/usr/bin/boinc ExecStop=/usr/bin/boinccmd --quit ExecReload=/usr/bin/boinccmd --read_cc_config ExecStopPost=/bin/rm -f lockfile IOSchedulingClass=idle # The following options prevent setuid root as they imply NoNewPrivileges=true # Since Atlas requires setuid root, they break Atlas # In order to improve security, if you're not using Atlas, # Add these options to the [Service] section of an override file using # sudo systemctl edit boinc-client.service #NoNewPrivileges=true #ProtectKernelModules=true #ProtectKernelTunables=true #RestrictRealtime=true #RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX #RestrictNamespaces=true #PrivateUsers=true #CapabilityBoundingSet= #MemoryDenyWriteExecute=true #PrivateTmp=true #Block X11 idle detection Based on my rudimentary understanding of these options, I have a feeling they only apply for native ATLAS. If I only run the vbox version, can I enable these options safely? |
37)
Message boards :
Number crunching :
Did cvmfs download ~150GB of data two days ago?
(Message 46567)
Posted 31 Mar 2022 by wujj123456 Post: Did you check whether your squid correctly rejects requests initiated from outside your LAN? Yes, the squid proxy only listens on internal interfaces and port 80. I can put a monitoring rule to check how much traffic there is on 8000 and 8080, but at least from what I see, I don't think there are major traffic not captured by the current setup for vbox WUs. You may try out the tuning options from my HowTo and reduce this to 256 MB. Good to know. I intend to capture system update and Steam updates too, which is why it's large. Though they are rare enough so most of time LHC is just having a great hit rate. :-) If you run Theory vbox each VM will set up it's own CVMFS cache (meanwhile old and degraded). That's what I thought, and also native consumes much less memory. However, if such big downloads happen often enough, it would change the balance. Thus my question trying to understand what happened and how likely or often could it happen again. |
38)
Message boards :
Number crunching :
Did cvmfs download ~150GB of data two days ago?
(Message 46564)
Posted 31 Mar 2022 by wujj123456 Post: Large downloads happen from time to time although 150 GB within 1 h is very unusual. This reminded me of a few interesting details. The server only had 70-80G available space left and it was not filled up AFAIK. There is no way it could have stored 150GB of data for sure. Meanwhile, the Squid cache I configured is 32GB on disk but 99%+ of hit bytes are served from the 4GB memory cache. It doesn't seem that I even need more than 4GB of data from cvmfs, assuming the vbox theory workload is similar to native other than setup. I'm pretty curious what this download is actually doing. It kinda feels like a bug TBH... |
39)
Message boards :
Number crunching :
Did cvmfs download ~150GB of data two days ago?
(Message 46563)
Posted 31 Mar 2022 by wujj123456 Post: Thanks for the reply. For the data cap, I mostly need to understand how much data I allocate for BOINC. Regarding ATLAS, it was running on the other Windows machine and I have set concurrency limit. Its usage is indeed high but very predictable and I've set aside enough for that. The server I mentioned here is S8026 in my list of computers, which only runs native Theory. Usually it has pretty low usage, but this download caught me off-guard a bit. I actually have Squid setup at my router directly and the hit rate is superb like 99%+ in terms of bytes for my Windows machine running both ATLAS and Theory in vbox. I didn't observe similar excessive download during the same period from the vbox WUs. Does that mean I might be better off forgoing the native installation but fully rely on Squid caching if I want predictable bandwidth usage? |
40)
Message boards :
Number crunching :
Did cvmfs download ~150GB of data two days ago?
(Message 46561)
Posted 31 Mar 2022 by wujj123456 Post: I setup native apps for theory application, and thus installed cvmfs. I noticed just now that from 2022-03-29 01:16 GMT-7 to 2022-03-29 02:46 GMT-7 (accurate to a minute or two), my server that runs LHC downloaded at full speed for more than an hour, totaling around 150GB of data. From logging on my router, I can see the source of traffic was all from 2606:4700:3033::6815:48a2, which is a cloudflare address. Then I retrieved syslog for my system and cvmfs related logs stood out: https://pastebin.com/rTVx9r3C. s1asgc-cvmfs.openhtc.io resolves to that exact address: https://pastebin.com/YfiR9SKB Unfortunately I have data cap from ISP so I need to be a bit more careful about such incidents. I've been running native theory on the same server for a year or two now and this is the first time I noticed such thing happening. I haven't touched its setup for quite a while, so I am fairly confident nothing should have changed on my end. Was this some one-time big update? A bug? Or is it expected from time to time? |
©2025 CERN