How long may Native-Theory-Tasks run

Author	Message
seanr22a Send message Joined: 29 Nov 18 Posts: 41 Credit: 2,644,024 RAC: 46	Message 51624 - Posted: 3 Mar 2025, 6:34:49 UTC Last modified: 3 Mar 2025, 7:03:51 UTC Old thread but still current. I learned today from this thread how to check the progress of Theory jobs. I have two Theory jobs that has been running almost 6 days that I will need to abort. It's not nice to the project but it's a total waste to let them continue. Both has been running 24/7 since they got downloaded. 100000 units of work. 36300 units processed after 6 days. The deadline is in 4 days so not a chance it will finish in time. I've seen other comments in this thread with suggestions about improvement of the app but it seems like Cern will not spend any time if it's not a bug found. But I pull the environmental card - There is probably hundreds maybe even thousands of computers running jobs for LHC Theory, most of them standard computers not anything super expensive so very few of them has 7-8 Gflops or more of computing power needed to finish this kind of job within the time frame even if running 24/7. This kind of job shouldn't even get sent to a underpowered computer and it is a huge waste of electricity because it will need to be aborted or Boinc will cancel automatic when not finished in time. My suggestions to save the environment: Cern knows the computing power of every attached computer - when a client ask for work do a check against the client database before sending any job, if the jobs available is to heavy for the client don't send Improve the % done counter as suggested earlier - the job example in this post has 100K work units, 36.1K done so it should show 36.1% done. Now it has been 100% many days. If possible make the Theory application multi-threaded That was my 'small' wishlist :) [edit] The computers I use is a couple of generations old Xeon with 2-2.5Gflops/thread in computing power. Not so fast compared with todays standards but I have many many cores and a lot of memory so I like multithreaded apps. ID: 51624 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,774,023 RAC: 79,317	Message 51626 - Posted: 3 Mar 2025, 8:11:15 UTC - in response to Message 51624. I have two Theory jobs that has been running almost 6 days that I will need to abort. (...) 100000 units of work. 36300 units processed after 6 days. Why abort? There's 40% walltime buffer left. Certain tasks run an initialization phase of few minutes before they start "processing", others need a couple of days. You just compare "x to be done" with "x done". But that's not enough since you don't know how long the task worked on those "x". Theory runs thousands of different combinations of "input parameters" and "MC generators". Runtimes are not really predictable. Nonetheless, the long term Theory success rate is better than 95% and includes valid longrunners. The computers I use is a couple of generations old Xeon with 2-2.5Gflops/thread in computing power So, you run less efficient CPUs and really play the waste card? If possible make the Theory application multi-threaded It has been tested. They don't generate more total throughput but cause many problems. ID: 51626 · Reply Quote

seanr22a Send message Joined: 29 Nov 18 Posts: 41 Credit: 2,644,024 RAC: 46	Message 51636 - Posted: 4 Mar 2025, 4:43:16 UTC - in response to Message 51626. Last modified: 4 Mar 2025, 4:53:46 UTC Thank you for your reply! So, you run less efficient CPUs and really play the waste card? Which computers being used depends on financial resources - I would very much want to have a 512core Epyc. Hopefully everything I run for different Boinc projects do something good but I don't like wasting CPU time that could be used for something better. If possible make the Theory application multi-threaded It has been tested. They don't generate more total throughput but cause many problems. That one cleared. Why abort? My mistake here, I looked at the wrong slot. That one was a working running job. Looking at the correct slot for the jobs that had been running 6 days and I had one that had been running 4 days (total 3 jobs) I terminated these jobs. All three had this in stderr output and the runRivet.log didn't show any progress at all: 18:38:11 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Probing /cvmfs/alice.cern.ch... OK 18:38:11 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Probing /cvmfs/cernvm-prod.cern.ch... OK 18:38:11 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Probing /cvmfs/grid.cern.ch... OK 18:38:11 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Probing /cvmfs/sft.cern.ch... OK 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Excerpt from "cvmfs_config stat": VERSION HOST PROXY 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] 2.12.6.0 http://s1swinburne-cvmfs.openhtc.io:8080/cvmfs/alice.cern.ch DIRECT 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Found 'runc version spec: 1.0.2-dev' at '/cvmfs/grid.cern.ch/vc/containers/runc.new'. 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Minor requirements are missing. Will try to run this task in legacy mode. 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Checking runc. 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Creating container filesystem. 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm4 mkdir: cannot create directory â/sys/fs/cgroup/unifiedâ: Read-only file system mkdir: cannot create directory â/sys/fs/cgroup/unifiedâ: Read-only file system mkdir: cannot create directory â/sys/fs/cgroup/unifiedâ: Read-only file system mkdir: cannot create directory â/sys/fs/cgroup/unifiedâ: Read-only file system mkdir: cannot create directory â/sys/fs/cgroup/unifiedâ: Read-only file system 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] Running Container 'runc'. 18:38:12 +07 +07:00 2025-02-25: cranky-0.1.4: [INFO] mcplots runspec: boinc pp jets 13000 520 - sherpa 1.4.3 default 100000 44 06:35:54 +07 +07:00 2025-02-26: cranky-0.1.4: [INFO] Pausing container Theory_2814-3964431-44_2. ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.1.4: line 202: [: missing `]' ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.1.4: line 202: -d: command not found 06:35:54 +07 +07:00 2025-02-26: cranky-0.1.4: [WARNING] Cannot pause container as /sys/fs/cgroup/freezer/boinc/freezer.state or /sys/fs/cgroup/freezer/boinc do not exist. Most Theory jobs run OK. The short time I've been running Theory I have 869 finished jobs and 3 I've I cancelled(6+6+4 days of wasted cpu/electricity). Maybe add Improved error handling to my whishlist :) [edit] Correct spelling error ID: 51636 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2679 Credit: 286,774,023 RAC: 79,317	Message 51637 - Posted: 4 Mar 2025, 8:31:11 UTC - in response to Message 51636. Maybe add Improved error handling to my whishlist On which level? There are many more than these: - OS - BOINC - cranky - runc - CVMFS - the various scientific apps combined under "Theory" First identify/understand on which level a specific error occurs, then you may report this to the responsible team. The stderr.txt logfile prints lots of information that show you what the task expects, what it finds instead and what it does to avoid a simple "failed". Example Root cause: Minor requirements are missing. Error handling: Will try to run this task in legacy mode. Root cause: your BOINC client requested to pause the task Error handling: Message "Cannot pause container" (because it's running in legacy mode and your OS configuration doesn't support/allow cgroups v1 'freezer') ID: 51637 · Reply Quote

seanr22a Send message Joined: 29 Nov 18 Posts: 41 Credit: 2,644,024 RAC: 46	Message 51638 - Posted: 4 Mar 2025, 13:42:19 UTC - in response to Message 51637. On which level? I understand that this is a complicated environment and a lot of things is outside my understanding. As a user I do everything I can to make it run well and for me a simple errorhandling is to abort zombie jobs automatic. It does not matter if it is user error or app errors, if someone like me can see in the log files that it has gone wrong it's easy to believe that the system knows it has gone wrong and should terminate. When terminated, a job has error status and it's easy to check the stderr and see what gone wrong, most likely user error :) Your answers is very much appreciated - now the sudoers issue has been solved, found https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6075&postid=48978 I hope it's the correct one. Can't do anything about the cgroups v1 issue as all modern Linux dialects is using v2. It is possible to reverse but it causes a lot of other issues in the system. I don't know why Boinc has tried to pause the jobs. I don't allow any extra jobs to be downloaded so it should not try: <work_buf_min_days>0</work_buf_min_days> \ <work_buf_additional_days>0</work_buf_additional_days> I never pause any jobs manually these systems run 24/7 with only LHC no other Boinc projects. The three jobs I had to cancel if you would get a minute to check if there is anything else I should fix: https://lhcathome.cern.ch/lhcathome/result.php?resultid=419879823 https://lhcathome.cern.ch/lhcathome/result.php?resultid=419848816 https://lhcathome.cern.ch/lhcathome/result.php?resultid=419817476 I was working as sysadmin and software developer 30-35 years ago and have done a lot since. Now retired and have this as one of my hobbies so I'm not totaly lost with computers :) ID: 51638 · Reply Quote

LHC@home