Message boards : ATLAS application : Problem of the day ATLAS
Message board moderation

To post messages, you must log in.

AuthorMessage
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 857
Credit: 703,166,075
RAC: 138,790
Message 46810 - Posted: 23 May 2022, 16:56:56 UTC
Last modified: 23 May 2022, 16:58:14 UTC

ID: 46810 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 46813 - Posted: 24 May 2022, 5:48:49 UTC - in response to Message 46810.  

Yes, saw this also, but only in a small number of Atlas-Tasks, also Guru Meditation, last week.
We can only control the Error-Tasks of Atlas or this one with too long runtime and deleting this Tasks.
ID: 46813 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1440
Credit: 9,657,640
RAC: 1,126
Message 46814 - Posted: 24 May 2022, 10:00:25 UTC - in response to Message 46813.  

When you see this happen, you could revive the task:

1. Suspend the task in BOINC with "leave in memory" not selected. The VM will be saved to disk.
2. With Virtual Box Manager:
- delete the saved state
- start the VM and let it run until the first events are processing
- stop the VM with writing the saved state to disk
3. Resume the task in BOINC
ID: 46814 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 46822 - Posted: 25 May 2022, 17:48:53 UTC - in response to Message 46814.  
Last modified: 25 May 2022, 17:49:59 UTC

2022-05-25 16:02:06 (11660): Guest Log: Running cvmfs_config stat atlas.cern.ch

2022-05-25 16:02:06 (11660): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE

2022-05-25 16:02:06 (11660): Guest Log: 2.6.3.0 1781 307445734561825742 32288 104734 4 1 1492424 4096000 0 65024 0 0 n/a 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch http://xx.yyy.zzz.aa:3128 1

2022-05-25 16:02:06 (11660): Guest Log: ATHENA_PROC_NUMBER=12

2022-05-25 16:02:06 (11660): Guest Log: *** Starting ATLAS job. (PandaID=5463929576 taskID=29107814) ***

2022-05-25 16:12:56 (11660): VM is no longer is a running state. It is in 'GuruMeditation'.
2022-05-25 16:12:56 (11660): VM state change detected. (old = 'Running', new = 'GuruMeditation')

2022-05-25 16:12:56 (11660): Powering off VM.
2022-05-25 16:12:56 (11660): Deregistering VM. (boinc_d20a7b32445566aa, slot#5)
2022-05-25 16:13:38 (11660): Removing network bandwidth throttle group from VM.
2022-05-25 16:13:39 (11660): Removing VM from VirtualBox.
2022-05-25 16:14:17 (11660): Virtual machine exited.
16:14:27 (11660): called boinc_finish(0)
ID: 46822 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 857
Credit: 703,166,075
RAC: 138,790
Message 46823 - Posted: 25 May 2022, 20:35:21 UTC

ID: 46823 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 46917 - Posted: 21 Jun 2022, 22:59:06 UTC

2022-06-21 20:26:53 (52460): Guest Log: 2.6.3.0 1852 307445734561825742 32172 105817 3 1 1492435 4096000 0 65024 0 0 n/a 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch http://xx.xxx.xxx.xx:3128 1
2022-06-21 20:26:53 (52460): Guest Log: ATHENA_PROC_NUMBER=12
2022-06-21 20:26:55 (52460): Guest Log: *** Starting ATLAS job. (PandaID=5497406599 taskID=29339193) ***
2022-06-21 22:02:59 (52460): Status Report: Elapsed Time: '6000.000000'
2022-06-21 22:02:59 (52460): Status Report: CPU Time: '29828.796875'
2022-06-21 23:43:05 (52460): Status Report: Elapsed Time: '12000.000000'
2022-06-21 23:43:05 (52460): Status Report: CPU Time: '66942.609375'
2022-06-22 00:41:18 (52460): Guest Log: *** Job finished ***

Computer ID 10795955 https://lhcathome.cern.ch/lhcathome/result.php?resultid=358429345
Laufzeit 4 hours 18 min. 52 sek.
CPU Zeit 23 hours 48 min. 43 sek.
Prüfungsstatus Gültig
Punkte 871.48

12 CPU's: 4 hours x 12 = 48 Hours.
CPU Time 23 hours 48 min. 43 sek??
ID: 46917 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 50826 - Posted: 16 Oct 2024, 16:54:19 UTC
Last modified: 16 Oct 2024, 17:52:59 UTC

native Atlas with this timestamps:
Endstatus 0 (0x00000000)
Computer ID 10816264
Laufzeit 7 Stunden 51 min. 44 sek.
CPU Zeit 3 Stunden 47 min. 11 sek.

Prüfungsstatus Gültig

CentOS9 -native with all updates, including from yesterday.
[2024-10-16 08:02:12] apptainer version 1.3.4-1.el9

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10816264
ID: 50826 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 50878 - Posted: 21 Oct 2024, 20:27:26 UTC

Computer ID 10797673
Laufzeit 1 Stunden 35 min. 14 sek.
CPU Zeit 1 min. 14 sek.
Prüfungsstatus Gültig
Punkte 648.32

First time seeing Atlas Task with this CPU-time!
ID: 50878 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2605
Credit: 262,154,800
RAC: 133,429
Message 50908 - Posted: 25 Oct 2024, 7:54:37 UTC

Lots of ATLAS tasks are failing due to a missing file "PDGTABLE.MeV".
[2024-10-25 09:36:41] 2024-10-25 07:36:26,733 | INFO     | exeerrordiag: Non-zero return code from EVNTtoHITS (8); Logfile error in log.EVNTtoHITS: "IOError: [Errno 2] No such file or directory: 'PDGTABLE.MeV'"
[2024-10-25 09:36:41] 2024-10-25 07:36:26,733 | INFO     | exitcode: 65
[2024-10-25 09:36:41] 2024-10-25 07:36:26,733 | INFO     | exitmsg: Non-zero return code from EVNTtoHITS (8); Logfile error in log.EVNTtoHITS: "IOError: [Errno 2] No such file or directory: 'PDGTABLE.MeV'"
ID: 50908 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 50914 - Posted: 25 Oct 2024, 15:44:46 UTC

In Win11pro today NO Atlas-Task starts correct.
Found for example this message at the end of the logfile:
<message>
upload failure: <file_xfer_error>
<file_name>JuRNDmXXXO6n9Rq4apOajLDm4fhM0noT9bVorHsSDmgV5KDmspl9qm_0_r1027194911_ATLAS_result</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

Theory also with problems, but CMS is running for the moment.
ID: 50914 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 745
Credit: 51,965,584
RAC: 31,382
Message 50916 - Posted: 25 Oct 2024, 17:52:42 UTC

ID: 50916 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 118
Credit: 12,876,808
RAC: 4,053
Message 50918 - Posted: 25 Oct 2024, 20:53:47 UTC - in response to Message 50916.  
Last modified: 25 Oct 2024, 20:58:02 UTC

Got one here, too.

I remember that files >2G have been a problem in the past, I thought it had been fixed....
ID: 50918 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 105
Credit: 26,099,112
RAC: 1,161
Message 50920 - Posted: 25 Oct 2024, 22:44:46 UTC

Multiple "file size too big" here

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415199482
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415206927
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415209263
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415209312
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415209313
ID: 50920 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 52
Credit: 66,850,956
RAC: 31,909
Message 50921 - Posted: 25 Oct 2024, 23:08:55 UTC - in response to Message 50920.  

Same here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=415205300

I also have results for those 2050 events that succeeded because they are just under 2GB. Now I'm not sure if I should just abort the others...
ID: 50921 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1841
Credit: 126,296,188
RAC: 124,680
Message 50922 - Posted: 26 Oct 2024, 5:50:36 UTC

So I am lucky: on one of my old notebooks there's a 50 events task running right now :-)
ID: 50922 · Report as offensive     Reply Quote
ktamail666

Send message
Joined: 11 Jul 06
Posts: 6
Credit: 2,915,386
RAC: 1
Message 50925 - Posted: 26 Oct 2024, 11:15:19 UTC
Last modified: 26 Oct 2024, 11:16:55 UTC

As I see it, this limit is set up in the work generator, determined by max_nbytes
https://boinc.berkeley.edu/trac/wiki/JobTemplates

In source code says default is 1 GB. So probably, there is 2 GB limit in the work generator.

<output_template>
<file_info>
<name><OUTFILE_0/></name>
<generated_locally/>
<upload_when_present/>
<max_nbytes>32768</max_nbytes>
<url><UPLOAD_URL/></url>
[ <gzip_when_done/> ]
</file_info>

<max_nbytes>
maximum file size. If the actual size exceeds this, the file will not be uploaded, and the job will be marked as an error.

I also found 2 bad native runs:
2029388744 Oct 24 13:08 shared/HITS.pool.root.1
2067339121 Oct 24 11:39 shared/HITS.pool.root.1

https://lhcathome.cern.ch/lhcathome/result.php?resultid=415199150
https://lhcathome.cern.ch/lhcathome/result.php?resultid=415199151
ID: 50925 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2260
Credit: 175,581,097
RAC: 15,522
Message 50926 - Posted: 26 Oct 2024, 11:37:24 UTC - in response to Message 50925.  

DataCenter Networkswitch down today.
Saw this on Cern support site.
Don't know, if this is a reason for the problems here.
ID: 50926 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1841
Credit: 126,296,188
RAC: 124,680
Message 50929 - Posted: 26 Oct 2024, 13:48:05 UTC - in response to Message 50926.  

DataCenter Networkswitch down today.
Saw this on Cern support site.
Don't know, if this is a reason for the problems here.
hm, all 3 subprojects are so far running okay on my hosts.
ID: 50929 · Report as offensive     Reply Quote

Message boards : ATLAS application : Problem of the day ATLAS


©2025 CERN