1) Message boards : ATLAS application : No Tasks from LHC@home (Message 49749)
Posted 16 days ago by wujj123456
Post:
From my experience, you just have to turn off the following Windows features: Virtual Machine Platform and Windows Hypervisor Platform. Unchecking these two should disable a bunch of other features depending on them, like sandbox, etc. Memory integrity still has to be turned off manually IIRC, which is a bit puzzling since it should depend on Virtual Machine Platform feature.

Ironically, WSL is actually fine so long as you use WSL1, not WSL2. WSL1 is not dependent on the Virtual Machine Platform feature. However, the default in Windows 11 is WSL2 and you have to manually set it to WSL1 and convert your images. WSL1 does lose some features like ability to mount luks, but in return, the start up is much faster and IPv6 works. I ended up liking WSL1 more and boot a real VM for the use case not supported by WSL1.
2) Message boards : ATLAS application : hits file upload fails immediately (Message 49742)
Posted 20 days ago by wujj123456
Post:
Thank you for fixing this. I see my pending uploads start draining since a few hours ago. Cheers.
3) Message boards : ATLAS application : hits file upload fails immediately (Message 49726)
Posted 21 days ago by wujj123456
Post:
Still failing the same way and still only for those 1.4G uploads while the smaller ones upload just fine.

I saw some WUs were aborted from server side two days ago. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=407100028. Does that mean those WUs are just mistakenly generated and have no science value anyway? If that's the case for all other such big uploads, I feel we might as well just abort them. Personally I don't care much about credits if the results aren't meaningful anyway. Losing them is better than crashing upload server all the time.
4) Message boards : ATLAS application : hits file upload fails immediately (Message 49698)
Posted 23 days ago by wujj123456
Post:
It seems that there was a preliminary misconfiguration of the BOINC jobs, and this should be fixed now.

I suppose this means server won't be configured to accept the big uploads. If so, will the bad WUs already sent out be aborted from the server side? Or should we just abort the upload after computation finishes?
5) Message boards : ATLAS application : hits file upload fails immediately (Message 49689)
Posted 26 days ago by wujj123456
Post:
Just to elaborate on my socks proxy, though I believe we've ruled it out already. You are correct that socks doesn't cache. I use socks not for LHC, but to route all BOINC traffic through it. I need to do traffic shaping for upload because my stupid asymmetric cable broadband has abysmal upload speed. :-(

As for squid, I run native and have cvmfs installed on each host with cloudflare CDN config. I used to have squid doing transparent caching on the router too, but the hit rate dropped to pretty much nothing after I installed cvmfs locally on each host. So I've removed it long time ago and pretty sure there is no squid anywhere in my network.
6) Message boards : ATLAS application : hits file upload fails immediately (Message 49685)
Posted 26 days ago by wujj123456
Post:
I'm not using squid though. I did use a sock5 proxy but bypassing that didn't help either.
7) Message boards : ATLAS application : hits file upload fails immediately (Message 49677)
Posted 27 days ago by wujj123456
Post:
All my stuck uploads are around 1.4G and the 800-900MB ones upload without a problem at the same time. I suppose those are 2K event WUs and server is not configured to accept files over some threshold?
8) Message boards : ATLAS application : hits file upload fails immediately (Message 49676)
Posted 27 days ago by wujj123456
Post:

There's not really a solution except that huge #events should not be configured by the submitter.
As you can read in another thread there are a couple of volunteers aggressively requesting those tasks contrary to all experience and agreements made in the past.

Well, we want bigger jobs assuming the project server can handle it. If LHC servers can't handle the upload, then of course it shouldn't issue such broken WUs. That's no different than any projects shouldn't release WUs that wouldn't work, big or small. It's just waste of everyone's time and resources.

If there were past agreements, does that mean whoever is submitting the batches are either new or not aware of the issues? In the meantime, is the only solution to abort the upload and thus fail these finished tasks?
9) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49663)
Posted 27 Feb 2024 by wujj123456
Post:
Overloading a system most of the time just to make up the idle time is not a real solution anyway. Fixing the workload to not idle is generally the best solution. The 20 minute time is with local cvmfs installation using cloudflare CDN. If during the time, the workload is pulling lots of data down from cvmfs and limited by bandwidth or disk, then I can more or less understand the behavior. However, it's not doing anything remotely intensive with network or disk either. The problem is likely either how the setup pulls the necessary data, or how fast cvmfs can serve the data, or both.

But hey, this is the only program on my computer that's running python 2.7 which has reached EOL 4 years ago. This also means it's definitely not using basic optimizations like asyncio to parallelize IO requests. I doubt much care for performance is put into the setup phase, given the actual calculation is likely done by some C/C++ library.
10) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49641)
Posted 24 Feb 2024 by wujj123456
Post:
Yep, ATLAS long seems to be the perfect answer to break this dilemma.

Another good part about ATLAS long is that it's a separate app. Now I can put in a different nthreads in app_config to throw more cores at the bigger problem without worrying about having 8 cores wait for 30 minutes just to do 10 minute of compute...
11) Message boards : ATLAS application : 2000 Events Threadripper 3995WX (Message 49626)
Posted 24 Feb 2024 by wujj123456
Post:
I personally prefer the bigger jobs. From what I see, each ATLAS WU always has a 20-30 min idle setup time. Having more work per WU is going to help with efficiency quite a bit. It also seems to reduce the network usage on download side (from client).
12) Message boards : ATLAS application : Download sometime between 20 and 50 kBps (Message 49193)
Posted 15 Jan 2024 by wujj123456
Post:
I just checked and I don't see any problem for downloading in the past few days. Consistently same 100-150Mbps just like before.
13) Message boards : ATLAS application : Thank you and goodbye! (Message 49181)
Posted 13 Jan 2024 by wujj123456
Post:
Given the sporadic nature and different characteristics of each batch, I'm hopeful that the prod jobs are released by human for actual science. Would be good to confirm for sure.
14) Message boards : ATLAS application : When cvmfs will be available for Ubuntu 22.04 LTS? (Message 48168)
Posted 1 Jun 2023 by wujj123456
Post:
Which guide did you follow? I ran Ubuntu 22.04 on multiple machines and had no problem getting it working with the simple instructions on the official page: https://cvmfs.readthedocs.io/en/stable/cpt-quickstart.html#debian-ubuntu. They also come with all the recommended configurations including Cloudflare's openhtc.io settings, so I didn't have to do any additional changes. I even use these .deb installation to get a copy of /etc/cvmfs to mount with the docker installation on my Arch Linux hosts.

Edit: Well, unless something changed in past few months since I haven't re-installed Ubuntu for a while.
15) Message boards : ATLAS application : Latest ATLAS jobs getting much larger in download size? (Message 47904)
Posted 26 Mar 2023 by wujj123456
Post:
Given there isn't really new science involved here as I thought, could moderator move this thread to ATLAS forum or just delete it? Thanks.
16) Message boards : Number crunching : Computer Optimization (Message 47902)
Posted 25 Mar 2023 by wujj123456
Post:
Regarding optimization, I honestly don't think there are specific ones related to hardware for the project. Since you are using Windows, you might want to set up the squid proxy, which could save a lot of bandwidth while making cvmfs access faster in your VM.

Other general optimization still applies. You have the same Zen 4 CPUs as I do and the first thing is to make sure your EXPO profile is enabled in UEFI, so the memory is running at the frequency you paid for instead of the default DDR5-4800. Then the "Curve Optimizer" in BIOS can get me ~5% more frequency, but it's a trial and error process trying to get the curve down as much as possible. I did it in UEFI, but given you use Windows, you might be able to use AMD's Ryzen Master to do the same without reboot.

Linux is generally better here if you want to run LHC, which is native to Linux. You can set up native cvmfs fairly easily on modern distros and avoid paying the VM overhead. Theory would be better on Linux too, but it currently requires some hacking of the start script to get around the cgroupv2 problem, or disable cgroupv2 for the system. sixtracks have native app for Windows, so I assume they are equally good.

These are all my experience though and I am not a developer here.
17) Message boards : Number crunching : Computer Optimization (Message 47901)
Posted 25 Mar 2023 by wujj123456
Post:
AVX512 is AFAIK not used yet "executing command: grep -o 'avx2[^ ]*\|AVX2[^ ]*' /proc/cpuinfo" from the Atlas native Log of this user with a CPU that is at least the same Generation as yours Ryzen 9 7950X 16-Core https://lhcathome.cern.ch/lhcathome/results.php?hostid=10821683&offset=0&show_names=0&state=4&appid=

grep -o means exact match, so it won't find avx512 even if it's there. In addition, the output is redirected by the python code logging these anyway, likely because it's checking the output to make decisions. The grep command shown should have returned 32 lines of avx2 since 7950X does support avx2 and have that in feature flags.

 $ grep -o 'avx2[^ ]*\|AVX2[^ ]*' /proc/cpuinfo | uniq -c
     32 avx2


Regarding avx512, you are probably still right anyway though. Given the fact the setup script is only grepping for avx2, it likely means avx512 isn't a concern whatsoever. I did some digging since I was curious after your reply and you happened to point to one of my host. :-P

The command that's consuming CPU on my host:

/cvmfs/atlas.cern.ch/repo/sw/software/21.0/sw/lcg/releases/LCG_87/Python/2.7.10/x86_64-slc6-gcc49-opt/bin/python -tt /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasCore/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/bin/athena.py --preloadlib=/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libintlc.so.5:/cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libimf.so runargs.EVNTtoHITS.py SimuJobTransforms/skeleton.EVGENtoHIT_ISF.py

Those python code is likely uninteresting wrappers and I bet bulk of the work is done inside those libs loaded. avx512 instructions use zmm registers so I grepped for those.
$ objdump -d /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libimf.so | grep zmm
$ objdump -d /cvmfs/atlas.cern.ch/repo/sw/software/21.0/AtlasExternals/21.0.15/InstallArea/x86_64-slc6-gcc49-opt/lib/libintlc.so.5 | grep zmm
   32dc7:       62 f1 fe 48 6f 06       vmovdqu64 (%rsi),%zmm0
   32dd4:       62 d1 7d 48 e7 03       vmovntdq %zmm0,(%r11)
   32dda:       62 f1 fe 48 6f 46 01    vmovdqu64 0x40(%rsi),%zmm0
   32de8:       62 d1 7d 48 e7 43 01    vmovntdq %zmm0,0x40(%r11)
   32def:       62 f1 fe 48 6f 46 02    vmovdqu64 0x80(%rsi),%zmm0
   32dfd:       62 d1 7d 48 e7 43 02    vmovntdq %zmm0,0x80(%r11)
   32e04:       62 f1 fe 48 6f 46 03    vmovdqu64 0xc0(%rsi),%zmm0
   32e12:       62 d1 7d 48 e7 43 03    vmovntdq %zmm0,0xc0(%r11)
   32e2e:       62 f1 fe 48 6f 46 fc    vmovdqu64 -0x100(%rsi),%zmm0
   32e3c:       62 d1 7d 48 e7 43 fc    vmovntdq %zmm0,-0x100(%r11)
   32e43:       62 f1 fe 48 6f 46 fd    vmovdqu64 -0xc0(%rsi),%zmm0
   32e51:       62 d1 7d 48 e7 43 fd    vmovntdq %zmm0,-0xc0(%r11)
   32e58:       62 f1 fe 48 6f 46 fe    vmovdqu64 -0x80(%rsi),%zmm0
   32e66:       62 d1 7d 48 e7 43 fe    vmovntdq %zmm0,-0x80(%r11)
   32e6d:       62 f1 fe 48 6f 46 ff    vmovdqu64 -0x40(%rsi),%zmm0
   32e7b:       62 d1 7d 48 e7 43 ff    vmovntdq %zmm0,-0x40(%r11)
   32f7b:       62 f1 7c 48 10 46 f9    vmovups -0x1c0(%rsi),%zmm0
   32f82:       62 d1 7c 48 29 43 f9    vmovaps %zmm0,-0x1c0(%r11)
   32f89:       62 f1 7c 48 10 46 fa    vmovups -0x180(%rsi),%zmm0
   32f90:       62 d1 7c 48 29 43 fa    vmovaps %zmm0,-0x180(%r11)
   32f97:       62 f1 7c 48 10 46 fb    vmovups -0x140(%rsi),%zmm0
   32f9e:       62 d1 7c 48 29 43 fb    vmovaps %zmm0,-0x140(%r11)
   32fa5:       62 f1 7c 48 10 46 fc    vmovups -0x100(%rsi),%zmm0
   32fac:       62 d1 7c 48 29 43 fc    vmovaps %zmm0,-0x100(%r11)
   32fb3:       62 f1 7c 48 10 46 fd    vmovups -0xc0(%rsi),%zmm0
   32fba:       62 d1 7c 48 29 43 fd    vmovaps %zmm0,-0xc0(%r11)
   32fc1:       62 f1 7c 48 10 46 fe    vmovups -0x80(%rsi),%zmm0
   32fc8:       62 d1 7c 48 29 43 fe    vmovaps %zmm0,-0x80(%r11)
   32fcf:       62 f1 7c 48 10 46 ff    vmovups -0x40(%rsi),%zmm0
   32fd6:       62 d1 7c 48 29 43 ff    vmovaps %zmm0,-0x40(%r11)
   330a9:       62 f1 7c 48 10 06       vmovups (%rsi),%zmm0
   330af:       62 d1 7c 48 11 03       vmovups %zmm0,(%r11)
   330ba:       62 f1 7c 48 10 06       vmovups (%rsi),%zmm0
   330c0:       62 d1 7c 48 11 03       vmovups %zmm0,(%r11)
   330c6:       62 d1 7c 48 10 40 ff    vmovups -0x40(%r8),%zmm0
   330cd:       62 d1 7c 48 11 42 ff    vmovups %zmm0,-0x40(%r10)
   3475f:       62 d2 7d 48 7c c1       vpbroadcastd %r9d,%zmm0
   347b0:       62 d1 7d 48 e7 02       vmovntdq %zmm0,(%r10)
   347b6:       62 d1 7d 48 e7 42 01    vmovntdq %zmm0,0x40(%r10)
   347bd:       62 d1 7d 48 e7 42 02    vmovntdq %zmm0,0x80(%r10)
   347c4:       62 d1 7d 48 e7 42 03    vmovntdq %zmm0,0xc0(%r10)
   347d9:       62 d1 7d 48 e7 42 fc    vmovntdq %zmm0,-0x100(%r10)
   347e0:       62 d1 7d 48 e7 42 fd    vmovntdq %zmm0,-0xc0(%r10)
   347e7:       62 d1 7d 48 e7 42 fe    vmovntdq %zmm0,-0x80(%r10)
   347ee:       62 d1 7d 48 e7 42 ff    vmovntdq %zmm0,-0x40(%r10)
   34888:       62 d1 7c 48 29 42 f9    vmovaps %zmm0,-0x1c0(%r10)
   3488f:       62 d1 7c 48 29 42 fa    vmovaps %zmm0,-0x180(%r10)
   34896:       62 d1 7c 48 29 42 fb    vmovaps %zmm0,-0x140(%r10)
   3489d:       62 d1 7c 48 29 42 fc    vmovaps %zmm0,-0x100(%r10)
   348a4:       62 d1 7c 48 29 42 fd    vmovaps %zmm0,-0xc0(%r10)
   348ab:       62 d1 7c 48 29 42 fe    vmovaps %zmm0,-0x80(%r10)
   348b2:       62 d1 7c 48 29 42 ff    vmovaps %zmm0,-0x40(%r10)
   3494c:       62 d1 7c 48 11 02       vmovups %zmm0,(%r10)
   34957:       62 d1 7c 48 11 02       vmovups %zmm0,(%r10)
   3495d:       62 d1 7c 48 11 40 ff    vmovups %zmm0,-0x40(%r8)

Other than a small chunk of code that's using avx512 for memcpy and memset, there are no avx512 compute instructions AFAIC. It's also not a given these instructions are actually executed even if they are in the library assembly.

PS: There are probably more direct way of confirming this by instrumenting with some tools like perf, but I don't do profiling enough to know how. :-(
18) Message boards : ATLAS application : Latest ATLAS jobs getting much larger in download size? (Message 47900)
Posted 25 Mar 2023 by wujj123456
Post:
Thanks. Sorry missed that. Didn't expect that thread to contain the information I was looking for. Hopefully the returned results are still useful even though the WUs weren't intended for BOINC. :-)
19) Message boards : ATLAS application : Latest ATLAS jobs getting much larger in download size? (Message 47898)
Posted 25 Mar 2023 by wujj123456
Post:
I noticed today that since Mar 22, my boinc download increased dramatically. After correlating detailed traffic data on my hosts with boinc log, I confirmed they are from ATLAS downloads. Each ATLAS WU now downloads around 1.1GB of data for the *.pool.root.1 file. They used to be 250M each and I still have some of the smaller ones as comparison. The WU disk usage also reflects the change, likely after decompression.
Example old WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=389847635
Example new WU: https://lhcathome.cern.ch/lhcathome/result.php?resultid=390698043

I remember reading about LHC upgrades last year and I wonder if this is the result of the upgrade? Would be interesting to know what's added here. I wonder if it would make sense to re-introduce long simulation WUs to balance out the network and compute ratio, though it's not really causing a problem for me.
20) Message boards : ATLAS application : No HITS File But Still Granted Credit? (Message 47637)
Posted 1 Jan 2023 by wujj123456
Post:
In the past few users complained about that and suggested not to reward the user in any case of an error.
But the project team decided to do it as it is now.

In this case, does the result show as "Completed and validated" or "Error while computing"? Even if the team decide to grant credit, I would like to get some signal that things were wrong. Unless the user is familiar with the internals of WU, the result status and credits are the only signal available to us to determine if anything is off. That's a signal common across all BOINC projects too.

I know this is ATLAS forum but I feel my experience with Theory is very relevant to this discussion. The first time I started running native Theory, I thought it's unexpected for some WU to run very long given the average behavior, so I had a cron to kill the worker process (e.g., Sherpa, rivetvm.exe, etc) if they run for more than 12 (or 24?) hours. I didn't abort the task directly simply because finding the offending long-running process with ps is simpler. The results were "Completed and validated" so I assumed my action had no side effects. If the WUs failed, I certainly wouldn't have continued to do this. Later I tried same with another machine running vbox and killing the vboxwrapper failed the task, which leads me to look closer. Finally I came to the forum and soon learnt it's normal for some Theory WUs to run long. Needless to say I don't kill any processes afterwards, but that's not before I generated a few dozen bogus results. Thus I prefer some clear way of knowing my results were bad, whether I get credit or not.

In those cases the project grants credit although it doesn't get it's own reward (the HITS file).

Hmm, even for people going after credits, the fact we all picked BOINC and a specific project, instead of some pointless workload should mean the science results are at least remotely relevant for us. I don't know how people would feel about getting credits while not actually helping. I personally would rather get not credit for errors so I can investigate further.


Next 20


©2024 CERN