1) Message boards : CMS Application : CMS Tasks Failing (Message 42814)
Posted 8 Jun 2020 by PaoloNasca
Post:
CMS tasks are designed to run 12h.
Tasks hitting this limit will finish the running job and then finish itself.
The 18h you see is a watchdog limit.


Are you meaning the CMS tasks need a minimum CPU power?
I explain me better.
The first fine ended CMS task lasted near 17 hours.
I think my PC isn’t suitable to help CMS@home project.
Your reply is really appreciated.
Thanks
2) Message boards : CMS Application : CMS Tasks Failing (Message 42813)
Posted 8 Jun 2020 by PaoloNasca
Post:
Did you have a look into the Console (ALT-F2) or Graphics (logfile) to see how many events were processed before the task was killed?


I don’t know how to read what you asked me.
Please give me more info and I’ll be very happy and proud to help the development team.
Thanks for your time.
3) Message boards : CMS Application : CMS Tasks Failing (Message 42810)
Posted 8 Jun 2020 by PaoloNasca
Post:
Only one CMS WU ended successfully.

I'd like to share a thought with Ivan and the developing team.

The WU will successfully end If the Elapsed Time is less than Job Duration (64800 sec = 18 hours).

My conclusion is: the VM has to be in “Running” state consecutively, I mean without any break/suspension.
What do you think to build a VM with 2 or more CPUs?
The Elapsed Time is inversely proportional to the CPU power.

Here below an extract from
https://lhcathome.cern.ch/lhcathome/result.php?resultid=276449837

….omitted….
2020-06-07 01:03:35 (12068): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 01:08:35 (12068): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 01:20:09 (12068): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 01:29:16 (12068): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 02:01:05 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 02:01:05 (12068): Status Report: Elapsed Time: '6000.992973'
2020-06-07 02:01:05 (12068): Status Report: CPU Time: '5208.796875'
2020-06-07 03:41:48 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 03:41:48 (12068): Status Report: Elapsed Time: '12000.992973'
2020-06-07 03:41:48 (12068): Status Report: CPU Time: '10897.078125'
2020-06-07 05:22:14 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 05:22:14 (12068): Status Report: Elapsed Time: '18000.992973'
2020-06-07 05:22:14 (12068): Status Report: CPU Time: '16861.296875'
2020-06-07 07:02:59 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 07:02:59 (12068): Status Report: Elapsed Time: '24000.992973'
2020-06-07 07:02:59 (12068): Status Report: CPU Time: '22545.656250'
2020-06-07 08:43:26 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 08:43:26 (12068): Status Report: Elapsed Time: '30001.494833'
2020-06-07 08:43:26 (12068): Status Report: CPU Time: '28544.906250'
2020-06-07 10:09:12 (12068): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 10:16:02 (12068): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 10:30:53 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 10:30:53 (12068): Status Report: Elapsed Time: '36001.494833'
2020-06-07 10:30:53 (12068): Status Report: CPU Time: '34246.296875'
2020-06-07 12:11:26 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 12:11:26 (12068): Status Report: Elapsed Time: '42001.494833'
2020-06-07 12:11:26 (12068): Status Report: CPU Time: '39958.921875'
2020-06-07 13:52:11 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 13:52:11 (12068): Status Report: Elapsed Time: '48001.494833'
2020-06-07 13:52:11 (12068): Status Report: CPU Time: '45994.109375'
2020-06-07 15:32:39 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 15:32:39 (12068): Status Report: Elapsed Time: '54001.494833'
2020-06-07 15:32:39 (12068): Status Report: CPU Time: '48189.859375'
2020-06-07 17:12:49 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 17:12:49 (12068): Status Report: Elapsed Time: '60001.494833'
2020-06-07 17:12:49 (12068): Status Report: CPU Time: '48252.187500'
2020-06-07 18:32:55 (12068): Powering off VM.
4) Message boards : CMS Application : CMS Tasks Failing (Message 42809)
Posted 7 Jun 2020 by PaoloNasca
Post:
Just to share and to comment what happened to two failed CMS WUs.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=276498495
https://lhcathome.cern.ch/lhcathome/result.php?resultid=276571790

The VM has been being in "Running" state for 12 minutes (from 00:14:57 to 00:26:51).
Then the VM has been being in "Paused" state for 18 hours (66684 seconds, from 00:26:51 to 18:38:07).
After that, the VM has again been being in "Running" state for 12 minutes.
At 18:50:23, the log shows the error: "Condor ended after 66684 second".

I'm really surprised.
18 hours of "Paused" state, it's no sense.
I explain me better.
After 18 hours of inactivity some procedures could face some time-out issue.

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
Funzione non corretta.
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
2020-06-07 00:14:39 (12788): Detected: vboxwrapper 26197
2020-06-07 00:14:39 (12788): Detected: BOINC client v7.7
2020-06-07 00:14:40 (12788): Detected: VirtualBox VboxManage Interface (Version: 5.2.8)
...omitted....
2020-06-07 00:14:57 (12788): VM state change detected. (old = 'PoweredOff', new = 'Running')
...omitted....
2020-06-07 00:19:02 (12788): Guest Log: [DEBUG] HTCondor ping
2020-06-07 00:19:03 (12788): Guest Log: [DEBUG] 0
2020-06-07 00:26:51 (12788): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 18:38:07 (12788): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 18:38:08 (12788): Guest Log: 00:10:40.055696 timesync vgsvcTimeSyncWorker: Radical host time change: 65 481 854 000 000ns (HostNow=1 591 547 868 663 000 000 ns HostLast=1 591 482 386 809 000 000 ns)
2020-06-07 18:38:18 (12788): Guest Log: 00:10:50.056476 timesync vgsvcTimeSyncWorker: Radical guest time change: 65 470 267 403 000ns (GuestNow=1 591 547 878 663 789 000 ns GuestLast=1 591 482 408 396 386 000 ns fSetTimeLastLoop=true )
2020-06-07 18:50:23 (12788): Guest Log: [ERROR] Condor ended after 66684 seconds.
2020-06-07 18:50:23 (12788): Guest Log: [INFO] Shutting Down.
5) Message boards : CMS Application : CMS Tasks Failing (Message 42398)
Posted 10 May 2020 by PaoloNasca
Post:
Today the error is about Condor

https://lhcathome.cern.ch/lhcathome/result.php?resultid=272679370

Guest Log: [INFO] CMS application starting. Check log files.
Guest Log: [DEBUG] HTCondor ping
Guest Log: [DEBUG] 0
Guest Log: [ERROR] Condor ended after 1324 seconds.
Guest Log: [INFO] Shutting Down.
6) Message boards : CMS Application : CMS Tasks Failing (Message 42360)
Posted 1 May 2020 by PaoloNasca
Post:
I'm facing the same issue. All WUs from "CMS Simulation v50.00 (vbox64) windows_x86_64" application fail.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272165131

ERROR: Couldn't read proxy from: /tmp/x509up_u0
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line
Guest Log: Use -debug for further information.
[ERROR] Could not get an x509 credential
[ERROR] The x509 proxy creation failed.



©2024 CERN