1)
Message boards :
ATLAS application :
Failed to execute payload:/bin/bash: Sim_tf.py: command not found
(Message 51029)
Posted 9 Nov 2024 by wujj123456 Post: I got a whole lot of invalid results due to this error across multiple machines starting today. The tasks don't show up as errors, but invalid. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=416180255 My normal troubleshooting command shows cvmfs is OK on the host. I do occasionally get WUs actually crunch for a while instead of giving up right away. $ cvmfs_config probe Probing /cvmfs/atlas.cern.ch... OK Probing /cvmfs/atlas-condb.cern.ch... OK Probing /cvmfs/grid.cern.ch... OK Probing /cvmfs/cernvm-prod.cern.ch... OK Probing /cvmfs/sft.cern.ch... OK Probing /cvmfs/alice.cern.ch... OK Is this a problem with my setup or are the latest batch of tasks bad? |
2)
Message boards :
ATLAS application :
Did the idle time increase for each task?
(Message 51000)
Posted 2 Nov 2024 by wujj123456 Post: 1) works for small tasks, until a new batch of larger tasks hit and they will take days to finish, without checkpointing too. Similarly for 2), even if I assume perfect splay, the fake cpus I should set depends on the compute:idle ratio, which in turn depends on task size. (I use navg_cpus to control number of WUs on the host, instead of directly faking ncpu for my host. The effect is equivalent anyway. ) Given the variation of job size from time to time, I'm not very keen on constantly monitoring and maintaining the config. So I've been sticking with a third option: limit ATLAS to half of the SMT threads and have other projects fill the rest. So at worst I idle the SMT thread of each core, not having a core sitting idle completely. These are all workarounds though. I made the post mostly trying to confirm it's not my own problem. When the workload characteristics suddenly change all that much, there is always a chance something actually broke. For example, perhaps some cvmfs config needs to be updated, etc. |
3)
Message boards :
ATLAS application :
Did the idle time increase for each task?
(Message 50998)
Posted 2 Nov 2024 by wujj123456 Post: Thanks. At least I still see HITS files produced for most of my tasks, so hopefully they aren't completely broken... |
4)
Message boards :
ATLAS application :
Did the idle time increase for each task?
(Message 50996)
Posted 2 Nov 2024 by wujj123456 Post: AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same? |
5)
Message boards :
ATLAS application :
Problem of the day ATLAS
(Message 50921)
Posted 25 Oct 2024 by wujj123456 Post: Same here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415205300 I also have results for those 2050 events that succeeded because they are just under 2GB. Now I'm not sure if I should just abort the others... |
6)
Message boards :
ATLAS application :
RDP showing 50, 100, 200, 400 or 2050 Collisions
(Message 50913)
Posted 25 Oct 2024 by wujj123456 Post: I wish we still have ATLAS long so I can configure a different thread count for the really huge ones, like the 2050 events. It took 40+ hours for the ones I've finished and there are still many running beyond that time currently. I would have given it 4x or 8x more cores if I knew the size beforehand. With such big variety in runtime, I can't easily find a balanced number to thread to keep runtime under control without having too many cores sitting idle during start. |
7)
Message boards :
ATLAS application :
atlas error
(Message 50253)
Posted 25 May 2024 by wujj123456 Post: Well you didn't search yourself. :-) @M0CZY's message above is the only one on this forum that showed the exact error messages. I ran into the same issue after upgrading to Ubuntu 24.04. Example failed task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411375194 The fix is to follow the workaround in this report. Execute the two commands as root. echo "kernel.apparmor_restrict_unprivileged_userns = 0" >/etc/sysctl.d/99-userns.conf sysctl --system Then the failed apptainer command in the error log should now print the host name when executed with normal user privilege. Note that this effectively reverts the tightened user namespace setting in Ubuntu 24.04. For anyone who knows apparmor configs better (not me), there will likely be a more restricted approach to only give apptainer the permission. |
8)
Message boards :
ATLAS application :
atlas error
(Message 50181)
Posted 14 May 2024 by wujj123456 Post: I just realized I wasted thousands of tasks and 1.5TB of the project bandwidth in past 20 hours... Oops and very sorry for that and I have paused all work fetch now. My setup was a bit weird, carried over from Arch when I ran the cvmfs container with configs copied over from an Ubuntu VM after installing the deb package there. Guess I can install the proper official packages now that I've switched back to Ubuntu. Hopefully that would make sure I always have the recommended configuration from now on. Is this the latest recommended configuration? (Edit: Guess yes. I have seen the task successfully find the nightlies repo afterwards.) Related note: I wonder if it's possible to have the task fail, instead of showing validation error after uploading the result for basic setup issues, like the missing repo error here? That way, BOINC client would automatically back off, instead of keeping fetching and uploading invalid results. I have monitoring for failed jobs on client side too. However, a successfully uploaded result marked as invalid requires me to check the website periodically. I noticed this today only from my bandwidth monitoring... |
9)
Message boards :
ATLAS application :
atlas error
(Message 50150)
Posted 9 May 2024 by wujj123456 Post: I got same invalid results for a bunch of my tasks too. (Example) I think the real problem is this line: > CVMFS is not available at /cvmfs/atlas-nightlies.cern.ch/repo/sw/logs/lastUpdate atlas-nightlies.cern.ch is not in the configured sites, so it's not mounted. Do we need an update for our cvmfs config, or perhaps this is a batch of jobs not meant to send out to volunteers? |
10)
Message boards :
ATLAS application :
No Tasks from LHC@home
(Message 49749)
Posted 12 Mar 2024 by wujj123456 Post: From my experience, you just have to turn off the following Windows features: Virtual Machine Platform and Windows Hypervisor Platform. Unchecking these two should disable a bunch of other features depending on them, like sandbox, etc. Memory integrity still has to be turned off manually IIRC, which is a bit puzzling since it should depend on Virtual Machine Platform feature. Ironically, WSL is actually fine so long as you use WSL1, not WSL2. WSL1 is not dependent on the Virtual Machine Platform feature. However, the default in Windows 11 is WSL2 and you have to manually set it to WSL1 and convert your images. WSL1 does lose some features like ability to mount luks, but in return, the start up is much faster and IPv6 works. I ended up liking WSL1 more and boot a real VM for the use case not supported by WSL1. |
11)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49742)
Posted 8 Mar 2024 by wujj123456 Post: Thank you for fixing this. I see my pending uploads start draining since a few hours ago. Cheers. |
12)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49726)
Posted 7 Mar 2024 by wujj123456 Post: Still failing the same way and still only for those 1.4G uploads while the smaller ones upload just fine. I saw some WUs were aborted from server side two days ago. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=407100028. Does that mean those WUs are just mistakenly generated and have no science value anyway? If that's the case for all other such big uploads, I feel we might as well just abort them. Personally I don't care much about credits if the results aren't meaningful anyway. Losing them is better than crashing upload server all the time. |
13)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49698)
Posted 5 Mar 2024 by wujj123456 Post: It seems that there was a preliminary misconfiguration of the BOINC jobs, and this should be fixed now. I suppose this means server won't be configured to accept the big uploads. If so, will the bad WUs already sent out be aborted from the server side? Or should we just abort the upload after computation finishes? |
14)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49689)
Posted 2 Mar 2024 by wujj123456 Post: Just to elaborate on my socks proxy, though I believe we've ruled it out already. You are correct that socks doesn't cache. I use socks not for LHC, but to route all BOINC traffic through it. I need to do traffic shaping for upload because my stupid asymmetric cable broadband has abysmal upload speed. :-( As for squid, I run native and have cvmfs installed on each host with cloudflare CDN config. I used to have squid doing transparent caching on the router too, but the hit rate dropped to pretty much nothing after I installed cvmfs locally on each host. So I've removed it long time ago and pretty sure there is no squid anywhere in my network. |
15)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49685)
Posted 2 Mar 2024 by wujj123456 Post: I'm not using squid though. I did use a sock5 proxy but bypassing that didn't help either. |
16)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49677)
Posted 1 Mar 2024 by wujj123456 Post: All my stuck uploads are around 1.4G and the 800-900MB ones upload without a problem at the same time. I suppose those are 2K event WUs and server is not configured to accept files over some threshold? |
17)
Message boards :
ATLAS application :
hits file upload fails immediately
(Message 49676)
Posted 1 Mar 2024 by wujj123456 Post:
Well, we want bigger jobs assuming the project server can handle it. If LHC servers can't handle the upload, then of course it shouldn't issue such broken WUs. That's no different than any projects shouldn't release WUs that wouldn't work, big or small. It's just waste of everyone's time and resources. If there were past agreements, does that mean whoever is submitting the batches are either new or not aware of the issues? In the meantime, is the only solution to abort the upload and thus fail these finished tasks? |
18)
Message boards :
ATLAS application :
2000 Events Threadripper 3995WX
(Message 49663)
Posted 27 Feb 2024 by wujj123456 Post: Overloading a system most of the time just to make up the idle time is not a real solution anyway. Fixing the workload to not idle is generally the best solution. The 20 minute time is with local cvmfs installation using cloudflare CDN. If during the time, the workload is pulling lots of data down from cvmfs and limited by bandwidth or disk, then I can more or less understand the behavior. However, it's not doing anything remotely intensive with network or disk either. The problem is likely either how the setup pulls the necessary data, or how fast cvmfs can serve the data, or both. But hey, this is the only program on my computer that's running python 2.7 which has reached EOL 4 years ago. This also means it's definitely not using basic optimizations like asyncio to parallelize IO requests. I doubt much care for performance is put into the setup phase, given the actual calculation is likely done by some C/C++ library. |
19)
Message boards :
ATLAS application :
2000 Events Threadripper 3995WX
(Message 49641)
Posted 24 Feb 2024 by wujj123456 Post: Yep, ATLAS long seems to be the perfect answer to break this dilemma. Another good part about ATLAS long is that it's a separate app. Now I can put in a different nthreads in app_config to throw more cores at the bigger problem without worrying about having 8 cores wait for 30 minutes just to do 10 minute of compute... |
20)
Message boards :
ATLAS application :
2000 Events Threadripper 3995WX
(Message 49626)
Posted 24 Feb 2024 by wujj123456 Post: I personally prefer the bigger jobs. From what I see, each ATLAS WU always has a 20-30 min idle setup time. Having more work per WU is going to help with efficiency quite a bit. It also seems to reduce the network usage on download side (from client). |
©2025 CERN