Message boards : CMS Application : tasks now running unusual long time without CPU usage
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,008,123
RAC: 121,908
Message 49465 - Posted: 8 Feb 2024, 17:28:21 UTC - in response to Message 49464.  

Again I am noticing the same phenomenon as referred to in the title of this thread:
the new tasks have now been running longer than the ones from before (several days ago), still not finished, and no longer using CPU for quite some time.
Is this coincidence or is the same problem from day before yesterday back?
still none of the tasks which started on all of my machines this early afternoon has finished yet. All running without CPU usage meanwhile.
Hence, the same problem as 2 days ago still exists. Something is definitely wrong with these tasks :-(
ID: 49465 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49466 - Posted: 8 Feb 2024, 17:46:43 UTC - in response to Message 49465.  

Have two Tasks started this morning.
One ended after 25 min., the other seem to go to the end after 11 or 18 hours,
only one job inside the Task finished so long.
MasterLog 2024-02-08 18:38 104K
StartdLog 2024-02-08 18:40 108K
finished_0.log 2024-02-08 12:15 39
running.log 2024-02-08 13:45 1.4M
Yes, maybe scheduling new job's, or something else is wrong.
ID: 49466 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 49467 - Posted: 8 Feb 2024, 18:38:09 UTC - in response to Message 49466.  

After the CMS job distribution restarted, I fetched 2 BOINC tasks.
Both started well and got their 1st CMS job inside the VM.
Meanwhile both tasks are running their 2nd job.
ID: 49467 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49473 - Posted: 9 Feb 2024, 4:46:56 UTC - in response to Message 49466.  

Have two Tasks started this morning.
One ended after 25 min., the other seem to go to the end after 11 or 18 hours,
only one job inside the Task finished so long.
MasterLog 2024-02-08 18:38 104K
StartdLog 2024-02-08 18:40 108K
finished_0.log 2024-02-08 12:15 39
running.log 2024-02-08 13:45 1.4M
Yes, maybe scheduling new job's, or something else is wrong.


Laufzeit 17 Stunden 22 min. 56 sek.
CPU Zeit 1 Stunden 32 min. 17 sek.
Prüfungsstatus Gültig
Punkte 242.83
ID: 49473 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,008,123
RAC: 121,908
Message 49474 - Posted: 9 Feb 2024, 6:35:52 UTC

my observation this morning is as follows:

the tasks finally finish after about a 4- or 5-fold time compared to before, with credit points about double compared to before.
ID: 49474 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49480 - Posted: 9 Feb 2024, 13:52:57 UTC - in response to Message 49474.  

my observation this morning is as follows:

the tasks finally finish after about a 4- or 5-fold time compared to before, with credit points about double compared to before.

On one of my two machines that can currently run BOINC[1], after the first job finishes the "running.log" file is visible via the "show graphics" button bur when I check machine status with "top" (Alt-F3 under "Show VM Console") I see no cmsRun job, but three python images using up an increasing amount of CPU time. Further delving (privileged access required) shows that these images are continually polling the code to update the log files on the Web page. Does anyone else see this? I haven't yet seen it on my other running box as it's so slow that it takes ages (>10 hours) to finish the first cmsRun job.

[1] Another Windows box is centrally managed; BOINC thinks it doesn't have Virtualisation capability, and the BIOS is password locked so I can't check. My Rocky Linux machine is currently refusing to get new tasks from the server, reason not yet established.
ID: 49480 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49481 - Posted: 9 Feb 2024, 13:58:03 UTC - in response to Message 49480.  

Only one job inside the CMS Task was running.
Seeing after this job in startd.log:
02/09/24 14:46:15 (pid:16904) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad.
02/09/24 14:46:15 (pid:16904) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.
02/09/24 14:46:15 (pid:16904) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.
ID: 49481 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49482 - Posted: 9 Feb 2024, 14:07:30 UTC - in response to Message 49481.  

Only one job inside the CMS Task was running.
Seeing after this job in startd.log:
02/09/24 14:46:15 (pid:16904) CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the STARTD ad.
02/09/24 14:46:15 (pid:16904) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_JOB_ATTRS = x509userproxysubject x509UserProxyFQAN x509UserProxyVOName x509UserProxyEmail x509UserProxyExpiration,MemoryUsage,ResidentSetSize,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.
02/09/24 14:46:15 (pid:16904) slot1: CONFIGURATION PROBLEM: Failed to insert ClassAd attribute STARTD_PARTITIONABLE_SLOT_ATTRS = MemoryUsage,ProportionalSetSizeKb. The most common reason for this is that you forgot to quote a string value in the list of attributes being added to the slot1 ad.

I've tried to get the developers interested in correcting those warnings, but without much(any) success. Perhaps it's time to try again.
ID: 49482 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,496,817
RAC: 2,374
Message 49483 - Posted: 9 Feb 2024, 17:09:33 UTC - in response to Message 49480.  

Does anyone else see this?
What I have noticed:
After a cmsRun has finished the running.log is filled with the complete information normally there (I don't know from where: another disk file or just from memory?)
However when a new cmsRun starts the contents of the running.log is copied to finished_X.log and the running.log is empty again - except the line "Running job output should appear here."
During the whole cmsRun job no information is added to running.log.
ID: 49483 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49484 - Posted: 9 Feb 2024, 17:52:11 UTC

Why is there no stop, when no new job inside the task starts in a limited time (for example half an hour)?
ID: 49484 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49487 - Posted: 9 Feb 2024, 22:17:47 UTC - in response to Message 49483.  

Does anyone else see this?
What I have noticed:
After a cmsRun has finished the running.log is filled with the complete information normally there (I don't know from where: another disk file or just from memory?)

What is supposed to happen -- and used to, as far as I recall -- is that a python script keeps the log files in the Web directory updated with the latest log files (which are hidden to some extent within a container); a sort of posh version of "tail -f". It's these scripts that I see continuing to consume CPU time after the first cmsRun job finishes.
However when a new cmsRun starts the contents of the running.log is copied to finished_X.log and the running.log is empty again - except the line "Running job output should appear here."

My problem is that no new cmsRun starts after the first completes

During the whole cmsRun job no information is added to running.log.

And that is one of the changed behaviours. I've raised these matters with our dedicated team, but I'm under the cosh at the moment to get another project finished by month's end, and have been "advised" to reduce my involvement in CMS@Home in the meantime.
ID: 49487 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,008,123
RAC: 121,908
Message 49490 - Posted: 10 Feb 2024, 6:12:35 UTC

Last night, on some of my hosts numerous tasks failed after a few minutes - see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=405976178

stderr starts with:

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
Der Dateiname oder die Erweiterung ist zu lang.
(0xce) - exit code 206 (0xce)
</message>

how come?
ID: 49490 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1006
Credit: 6,272,232
RAC: 315
Message 49491 - Posted: 10 Feb 2024, 6:25:49 UTC - in response to Message 49490.  

Last night, on some of my hosts numerous tasks failed after a few minutes - see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=405976178

stderr starts with:

7.24.1

Der Dateiname oder die Erweiterung ist zu lang.
(0xce) - exit code 206 (0xce)


how come?

From the looks of the rest of it, you were having some network connection problems. The result output is the concatenation of several logs at the end of the task, so the output is not necessarily in chronological order.
ID: 49491 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1690
Credit: 104,008,123
RAC: 121,908
Message 49494 - Posted: 10 Feb 2024, 7:08:44 UTC - in response to Message 49491.  

From the looks of the rest of it, you were having some network connection problems. The result output is the concatenation of several logs at the end of the task, so the output is not necessarily in chronological order.
thank you, Ivan, for the clarification.
I should have read the complete stderr text, and not just the first few lines - sorry for that :-(

no idea what happened to the network, I'll need to watch this more closely.
ID: 49494 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49495 - Posted: 10 Feb 2024, 7:25:08 UTC

02/10/24 08:20:21 (pid:16889) CCBListener: failed to receive message from CCB server vocms0840.cern.ch
02/10/24 08:20:21 (pid:16889) CCBListener: connection to CCB server vocms0840.cern.ch failed; will try to reconnect in 60 seconds.
02/10/24 08:21:21 (pid:16889) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=[2001-1458-d00-14--b3]-9618+137.138.156.85-9618&alias=vocms0840.cern.ch#12816267
Is this the network issue?
ID: 49495 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2413
Credit: 226,589,365
RAC: 130,105
Message 49496 - Posted: 10 Feb 2024, 7:54:48 UTC - in response to Message 49495.  

Is this the network issue?

No.
The log reports this:
2024-02-09 23:11:31 (119660): Guest Log: [ERROR] Could not connect to eoscms-ns-ip563.cern.ch on port 1094
ID: 49496 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49499 - Posted: 10 Feb 2024, 8:22:53 UTC - in response to Message 49496.  
Last modified: 10 Feb 2024, 9:18:45 UTC

The second job inside the task is starting NOW :-)

In Win11pro, this file is not shown!
First job finished,
last lines:
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p -v --abort-on-failure file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/2/10/ireid_TC_SLC7_IDR_CMS_Home_240209_153748_6856/SinglePiE50HCAL_pythia8_2018_GenSimFull/0001/0/75fb2315-5801-428e-9706-57e7a51ea486-120-0-logArchive.tar.gz'
EXIT_STATUS=$?
echo "gfal-copy exit status: $EXIT_STATUS"
if [[ $EXIT_STATUS != 0 ]]; then
echo "ERROR: gfal-copy exited with $EXIT_STATUS"
echo "Cleaning up failed file:"
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/2/10/ireid_TC_SLC7_IDR_CMS_Home_240209_153748_6856/SinglePiE50HCAL_pythia8_2018_GenSimFull/0001/0/75fb2315-5801-428e-9706-57e7a51ea486-120-0-logArchive.tar.gz '
fi
exit $EXIT_STATUS

ExitCode : 6
ErrorCode : 60311
ErrorType : GeneralStageOutFailure

Traceback:

<@---------- WMException End ----------@>

This lines new:
INFO:root:StepName: logArch1, StepType: LogArchive, with result: 0
INFO:root:MonitorThread: JobEnded
INFO:root:MonitorState: Shutdown called
INFO:root:Completing task at directory: /srv/job/WMTaskSpace
INFO:root:Looking for master report at /srv/job/WMTaskSpace/../../Report.0.pkl
INFO:root: found it!
INFO:root:Looking for a taskStep report at /srv/job/WMTaskSpace/cmsRun1/Report.pkl
INFO:root: found it!
INFO:root:Looking for a taskStep report at /srv/job/WMTaskSpace/stageOut1/Report.pkl
INFO:root: found it!
INFO:root:Looking for a taskStep report at /srv/job/WMTaskSpace/logArch1/Report.pkl
INFO:root: found it!
INFO:root:Shutting down monitor
+ cp WMTaskSpace/Report.0.pkl ../
+ ls -l WMTaskSpace
+ ls -l WMTaskSpace/Report.0.pkl WMTaskSpace/__init__.py WMTaskSpace/__pycache__ WMTaskSpace/cmsRun1 WMTaskSpace/logArch1 WMTaskSpace/stageOut1
+ set +x
ID: 49499 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49507 - Posted: 10 Feb 2024, 15:41:14 UTC - in response to Message 49499.  

since 12:30 UTC those two tasks running without job's to their end at 18 hours.
One had two jobs finished, the other three.
ID: 49507 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 59
Credit: 543,728
RAC: 42
Message 49521 - Posted: 11 Feb 2024, 17:36:35 UTC

Post here after it is fixed.
In all other cases do not post.
ID: 49521 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2104
Credit: 159,819,191
RAC: 123,837
Message 49537 - Posted: 13 Feb 2024, 6:57:40 UTC

Since 4:20 UTC no new job inside the CMS-Task.
ID: 49537 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : CMS Application : tasks now running unusual long time without CPU usage


©2024 CERN