Message boards :
CMS Application :
tasks now running unusual long time without CPU usage
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,860,185 RAC: 78,358 |
Again I am noticing the same phenomenon as referred to in the title of this thread:still none of the tasks which started on all of my machines this early afternoon has finished yet. All running without CPU usage meanwhile. Hence, the same problem as 2 days ago still exists. Something is definitely wrong with these tasks :-( |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
Have two Tasks started this morning. One ended after 25 min., the other seem to go to the end after 11 or 18 hours, only one job inside the Task finished so long. MasterLog 2024-02-08 18:38 104K StartdLog 2024-02-08 18:40 108K finished_0.log 2024-02-08 12:15 39 running.log 2024-02-08 13:45 1.4M Yes, maybe scheduling new job's, or something else is wrong. |
Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954 |
After the CMS job distribution restarted, I fetched 2 BOINC tasks. Both started well and got their 1st CMS job inside the VM. Meanwhile both tasks are running their 2nd job. |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
Have two Tasks started this morning. Laufzeit 17 Stunden 22 min. 56 sek. CPU Zeit 1 Stunden 32 min. 17 sek. Prüfungsstatus Gültig Punkte 242.83 |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,860,185 RAC: 78,358 |
my observation this morning is as follows: the tasks finally finish after about a 4- or 5-fold time compared to before, with credit points about double compared to before. |
Send message Joined: 29 Aug 05 Posts: 1009 Credit: 6,288,638 RAC: 1,161 |
my observation this morning is as follows: On one of my two machines that can currently run BOINC[1], after the first job finishes the "running.log" file is visible via the "show graphics" button bur when I check machine status with "top" (Alt-F3 under "Show VM Console") I see no cmsRun job, but three python images using up an increasing amount of CPU time. Further delving (privileged access required) shows that these images are continually polling the code to update the log files on the Web page. Does anyone else see this? I haven't yet seen it on my other running box as it's so slow that it takes ages (>10 hours) to finish the first cmsRun job. [1] Another Windows box is centrally managed; BOINC thinks it doesn't have Virtualisation capability, and the BIOS is password locked so I can't check. My Rocky Linux machine is currently refusing to get new tasks from the server, reason not yet established. |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
Only one job inside the CMS Task was running. Seeing after this job in startd.log: 02/09/24 14:46:15 (pid:16904) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad. 02/09/24 14:46:15 (pid:16904) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. 02/09/24 14:46:15 (pid:16904) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad. |
Send message Joined: 29 Aug 05 Posts: 1009 Credit: 6,288,638 RAC: 1,161 |
Only one job inside the CMS Task was running. I've tried to get the developers interested in correcting those warnings, but without much(any) success. Perhaps it's time to try again. |
Send message Joined: 14 Jan 10 Posts: 1291 Credit: 8,544,599 RAC: 3,954 |
Does anyone else see this?What I have noticed: After a cmsRun has finished the running.log is filled with the complete information normally there (I don't know from where: another disk file or just from memory?) However when a new cmsRun starts the contents of the running.log is copied to finished_X.log and the running.log is empty again - except the line "Running job output should appear here." During the whole cmsRun job no information is added to running.log. |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
Why is there no stop, when no new job inside the task starts in a limited time (for example half an hour)? |
Send message Joined: 29 Aug 05 Posts: 1009 Credit: 6,288,638 RAC: 1,161 |
Does anyone else see this?What I have noticed: What is supposed to happen -- and used to, as far as I recall -- is that a python script keeps the log files in the Web directory updated with the latest log files (which are hidden to some extent within a container); a sort of posh version of "tail -f". It's these scripts that I see continuing to consume CPU time after the first cmsRun job finishes. However when a new cmsRun starts the contents of the running.log is copied to finished_X.log and the running.log is empty again - except the line "Running job output should appear here." My problem is that no new cmsRun starts after the first completes
And that is one of the changed behaviours. I've raised these matters with our dedicated team, but I'm under the cosh at the moment to get another project finished by month's end, and have been "advised" to reduce my involvement in CMS@Home in the meantime. |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,860,185 RAC: 78,358 |
Last night, on some of my hosts numerous tasks failed after a few minutes - see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=405976178 stderr starts with: <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> Der Dateiname oder die Erweiterung ist zu lang. (0xce) - exit code 206 (0xce)</message> how come? |
Send message Joined: 29 Aug 05 Posts: 1009 Credit: 6,288,638 RAC: 1,161 |
Last night, on some of my hosts numerous tasks failed after a few minutes - see here: From the looks of the rest of it, you were having some network connection problems. The result output is the concatenation of several logs at the end of the task, so the output is not necessarily in chronological order. |
Send message Joined: 18 Dec 15 Posts: 1693 Credit: 104,860,185 RAC: 78,358 |
From the looks of the rest of it, you were having some network connection problems. The result output is the concatenation of several logs at the end of the task, so the output is not necessarily in chronological order.thank you, Ivan, for the clarification. I should have read the complete stderr text, and not just the first few lines - sorry for that :-( no idea what happened to the network, I'll need to watch this more closely. |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
02/10/24 08:20:21 (pid:16889) CCBListener: failed to receive message from CCB server vocms0840.cern.ch 02/10/24 08:20:21 (pid:16889) CCBListener: connection to CCB server vocms0840.cern.ch failed; will try to reconnect in 60 seconds. 02/10/24 08:21:21 (pid:16889) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=[2001-1458-d00-14--b3]-9618+137.138.156.85-9618&alias=vocms0840.cern.ch#12816267 Is this the network issue? |
Send message Joined: 15 Jun 08 Posts: 2435 Credit: 228,110,422 RAC: 123,771 |
Is this the network issue? No. The log reports this: 2024-02-09 23:11:31 (119660): Guest Log: [ERROR] Could not connect to eoscms-ns-ip563.cern.ch on port 1094 |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
The second job inside the task is starting NOW :-) In Win11pro, this file is not shown! First job finished, last lines: env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p -v --abort-on-failure file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/2/10/ireid_TC_SLC7_IDR_CMS_Home_240209_153748_6856/SinglePiE50HCAL_pythia8_2018_GenSimFull/0001/0/75fb2315-5801-428e-9706-57e7a51ea486-120-0-logArchive.tar.gz' EXIT_STATUS=$? echo "gfal-copy exit status: $EXIT_STATUS" if [[ $EXIT_STATUS != 0 ]]; then echo "ERROR: gfal-copy exited with $EXIT_STATUS" echo "Cleaning up failed file:" env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/2/10/ireid_TC_SLC7_IDR_CMS_Home_240209_153748_6856/SinglePiE50HCAL_pythia8_2018_GenSimFull/0001/0/75fb2315-5801-428e-9706-57e7a51ea486-120-0-logArchive.tar.gz ' fi exit $EXIT_STATUS ExitCode : 6 ErrorCode : 60311 ErrorType : GeneralStageOutFailure Traceback: <@---------- WMException End ----------@> This lines new: INFO:root:StepName: logArch1, StepType: LogArchive, with result: 0 INFO:root:MonitorThread: JobEnded INFO:root:MonitorState: Shutdown called INFO:root:Completing task at directory: /srv/job/WMTaskSpace INFO:root:Looking for master report at /srv/job/WMTaskSpace/../../Report.0.pkl INFO:root: found it! INFO:root:Looking for a taskStep report at /srv/job/WMTaskSpace/cmsRun1/Report.pkl INFO:root: found it! INFO:root:Looking for a taskStep report at /srv/job/WMTaskSpace/stageOut1/Report.pkl INFO:root: found it! INFO:root:Looking for a taskStep report at /srv/job/WMTaskSpace/logArch1/Report.pkl INFO:root: found it! INFO:root:Shutting down monitor + cp WMTaskSpace/Report.0.pkl ../ + ls -l WMTaskSpace + ls -l WMTaskSpace/Report.0.pkl WMTaskSpace/__init__.py WMTaskSpace/__pycache__ WMTaskSpace/cmsRun1 WMTaskSpace/logArch1 WMTaskSpace/stageOut1 + set +x |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
since 12:30 UTC those two tasks running without job's to their end at 18 hours. One had two jobs finished, the other three. |
Send message Joined: 21 Feb 11 Posts: 59 Credit: 543,728 RAC: 13 |
Post here after it is fixed. In all other cases do not post. |
Send message Joined: 2 May 07 Posts: 2125 Credit: 159,946,272 RAC: 40,442 |
Since 4:20 UTC no new job inside the CMS-Task. |
©2024 CERN