Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 45934 - Posted: 22 Dec 2021, 16:50:12 UTC - in response to Message 45933.  

At QuChemPedIA@home upgrading BOINC to 7.16.20 solved a problem with security certificates.
Tullio
same was/is true for GPUGRID, among others.
But my problem of earlier today was somehting different, obviously.
Tasks which I downloaded later are working again.
ID: 45934 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 45987 - Posted: 4 Jan 2022, 7:22:49 UTC
Last modified: 4 Jan 2022, 8:00:59 UTC

https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10631979
2022-01-04 08:05:54 (8452): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2022-01-04 08:05:54 (8452): Guest Log: 2.6.3.0 1439 307445734561825800 29676 99016 63 1 1491998 4096001 0 65024 0 102 98.0392 533 645 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1
2022-01-04 08:05:54 (8452): Guest Log: ATHENA_PROC_NUMBER=2
2022-01-04 08:05:54 (8452): Guest Log: *** Starting ATLAS job. (PandaID=5314311303 taskID=27722525) ***
2022-01-04 08:05:54 (8452): Guest Log: *** Job finished ***
2022-01-04 08:05:54 (8452): Guest Log: *** The last 20 lines of the pilot log: ***
2022-01-04 08:05:54 (8452): Guest Log: /usr/bin/time: cannot run ./runpilot2-wrapper.sh: Permission denied

Win11pro. Have this message from Boinc:
LHC@home: Notice from BOINC
Missing </app_version> in app_config.xml
04.01.2022 08:45:57
app_config:
<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>1</max_concurrent>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>2.000000</<avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 5000</cmdline>
</app_version>
</app_config>
ID: 45987 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 45991 - Posted: 4 Jan 2022, 10:32:08 UTC - in response to Message 45987.  

Have changed app_config.xml from an other Win11pro.
<app_config>
<app>
<name>Theory</name>
<max_concurrent>1</max_concurrent>
<report_results_immediately>1</report_results_immediately>
</app>
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>6</avg_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 26250</cmdline>
</app_version>
</app_config>
Your app_config.xml file refers to an unknown application 'Theory'. Known applications: 'CMS', 'sixtrack', 'ATLAS'
04.01.2022 11:22:52

Will checking again, don't know why this difficult using of app_config.
ID: 45991 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2531
Credit: 253,722,201
RAC: 41,981
Message 45992 - Posted: 4 Jan 2022, 11:12:41 UTC - in response to Message 45991.  

See:
https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration
Strictly follow the template given there.
<report_results_immediately>1</report_results_immediately> must not be enclosed by <app>...</app>.

BOINC's xml files must be stored using ANSI encoding.
Using unicode would be wrong.



<cmdline>--memory_size_mb 26250</cmdline>

Either a typo or a huge waste of RAM.
6-core ATLAS should be set to 8400 MB.
ATLAS does not run faster with more RAM, but RAM allocated by the VM would be locked by VirtualBox and never given back to the OS until the VM shuts down.


Your app_config.xml file refers to an unknown application 'Theory'.

This BOINC client hasn't got a Theory task since the last project reset.
app_config.xml changes corresponding tags/values in client_state.xml.
The message just mentions that there are no corresponding tags regarding 'Theory'.

Download a Theory task, then reload the config files.
ID: 45992 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 45993 - Posted: 4 Jan 2022, 11:27:32 UTC - in response to Message 45992.  

Thanks,
testing some RAM-using with more CPU's. Will reduce it to normal high of using.
ID: 45993 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 46002 - Posted: 5 Jan 2022, 11:56:48 UTC - in response to Message 45993.  

the bad WUs which appeared here recently are back :-(
I had several ones since last night, e.g.:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=338090618
https://lhcathome.cern.ch/lhcathome/result.php?resultid=338090668

excerpt from stderr:

2022-01-05 02:27:06 (15472): Guest Log: 00:00:10.255596 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 212 552 000ns (GuestNow=1 641 346 026 423 484 000 ns GuestLast=1 641 349 614 636 036 000 ns fSetTimeLastLoop=true )

how nice if you wake up in the morning just to find out that such WUs were blocking a slot all night through :-(
ID: 46002 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 46004 - Posted: 5 Jan 2022, 15:12:37 UTC - in response to Message 46002.  

the bad WUs which appeared here recently are back :-(
I had several ones since last night ...
meanwhile, I've had quite a number of these faulty tasks spread over all of my machines.

Is it not possible back there to stop them instead of sending them out?
ID: 46004 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,051
RAC: 885
Message 46008 - Posted: 6 Jan 2022, 7:42:26 UTC - in response to Message 46002.  

excerpt from stderr:

2022-01-05 02:27:06 (15472): Guest Log: 00:00:10.255596 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 212 552 000ns (GuestNow=1 641 346 026 423 484 000 ns GuestLast=1 641 349 614 636 036 000 ns fSetTimeLastLoop=true )

how nice if you wake up in the morning just to find out that such WUs were blocking a slot all night through :-(

These 'Radical guest time change' is not the reason for the task hanging around.
This time change you see in every result, also the valid ones just after the VM started. Mostly about 1 hour difference noted.
You may also see those messages after a suspend and a resume of a task.
So there must be another reason. I saw lately 1 task on my machine idle, where 5 VM's started shortly after each other.
Just suspended that task with LAIM off. https://lhcathome.cern.ch/lhcathome/result.php?resultid=338111927
Disgarded the snapshot with VirtualBox Manager. Started the VM myself outside of BOINC until the athena's were running. Saved the machine to disk and let BOINC do the rest.
ID: 46008 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 46009 - Posted: 6 Jan 2022, 8:59:49 UTC - in response to Message 46008.  

These 'Radical guest time change' is not the reason for the task hanging around.
This time change you see in every result, also the valid ones just after the VM started. Mostly about 1 hour difference noted.
You may also see those messages after a suspend and a resume of a task.
So there must be another reason. I saw lately 1 task on my machine idle, where 5 VM's started shortly after each other.
...
oh, yes, you are right: this "Radical guest time change" message can be seen in every finished task. Thanks for the hint.
On one of my machines I run 5 tasks concurrently, and once in a while it may happen that one starts shortly after another one.
However: yesterday I noticed this problem on two machines where only 1 task is being run - so this would rather preclude the above assumption.
What also catches the eye: when this happens, it happens several times within a period of 1-2 days, and then not for a lenghy time thereafter. In fact, I saw this happen for the first time about a week ago, not any time before.
And the next time it happened was yesterday. And not only on one given machine.
So this would lead me to the assumption that it has to do with the task itself, i.e. the task is faulty.
ID: 46009 · Report as offensive     Reply Quote
Michal Kinďura

Send message
Joined: 27 Jun 08
Posts: 1
Credit: 1,040,399
RAC: 12
Message 46026 - Posted: 8 Jan 2022, 10:42:37 UTC

Hello guys,

I have 4 WUs which running over 14 hours and progress bar is at 99.999%. CPU utilization is from the beginning around 0%.
Today morning, I canceled one of the tasks and when the new started crunching, everything looks good for that new one.

This should be output for the stucked WU:

    2022-01-07 20:24:06 (24152): Detected: vboxwrapper 26197
    2022-01-07 20:24:06 (24152): Detected: BOINC client v7.7
    2022-01-07 20:24:07 (24152): Detected: VirtualBox VboxManage Interface (Version: 6.1.12)
    2022-01-07 20:24:07 (24152): Successfully copied 'init_data.xml' to the shared directory.
    2022-01-07 20:24:09 (24152): Create VM. (boinc_86d4d189c256c343, slot#1)
    2022-01-07 20:24:10 (24152): Setting Memory Size for VM. (10200MB)
    2022-01-07 20:24:11 (24152): Setting CPU Count for VM. (8)
    2022-01-07 20:24:12 (24152): Setting Chipset Options for VM.
    2022-01-07 20:24:13 (24152): Setting Boot Options for VM.
    2022-01-07 20:24:13 (24152): Setting Network Configuration for NAT.
    2022-01-07 20:24:15 (24152): Enabling VM Network Access.
    2022-01-07 20:24:16 (24152): Disabling USB Support for VM.
    2022-01-07 20:24:16 (24152): Disabling COM Port Support for VM.
    2022-01-07 20:24:17 (24152): Disabling LPT Port Support for VM.
    2022-01-07 20:24:18 (24152): Disabling Audio Support for VM.
    2022-01-07 20:24:18 (24152): Disabling Clipboard Support for VM.
    2022-01-07 20:24:19 (24152): Disabling Drag and Drop Support for VM.
    2022-01-07 20:24:20 (24152): Adding storage controller(s) to VM.
    2022-01-07 20:24:21 (24152): Adding virtual disk drive to VM. (vm_image.vdi)
    2022-01-07 20:24:23 (24152): Adding VirtualBox Guest Additions to VM.
    2022-01-07 20:24:24 (24152): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
    2022-01-07 20:24:25 (24152): forwarding host port 58242 to guest port 80
    2022-01-07 20:24:26 (24152): Enabling remote desktop for VM.
    2022-01-07 20:24:26 (24152): Required extension pack not installed, remote desktop not enabled.
    2022-01-07 20:24:26 (24152): Enabling shared directory for VM.
    2022-01-07 20:24:27 (24152): Starting VM using VBoxManage interface. (boinc_86d4d189c256c343, slot#1)
    2022-01-07 20:24:33 (24152): Successfully started VM. (PID = '7256')
    2022-01-07 20:24:33 (24152): Reporting VM Process ID to BOINC.
    2022-01-07 20:24:33 (24152): Guest Log: BIOS: VirtualBox 6.1.12

    2022-01-07 20:24:33 (24152): Guest Log: CPUID EDX: 0x178bfbff

    2022-01-07 20:24:33 (24152): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63

    2022-01-07 20:24:33 (24152): VM state change detected. (old = 'PoweredOff', new = 'Running')
    2022-01-07 20:24:33 (24152): Detected: Web Application Enabled (http://localhost:58242)
    2022-01-07 20:24:34 (24152): Preference change detected
    2022-01-07 20:24:34 (24152): Setting CPU throttle for VM. (100%)
    2022-01-07 20:24:34 (24152): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 900 seconds))
    2022-01-07 20:24:35 (24152): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032

    2022-01-07 20:24:35 (24152): Guest Log: BIOS: Booting from Hard Disk...

    2022-01-07 20:24:37 (24152): Guest Log: BIOS: KBD: unsupported int 16h function 03

    2022-01-07 20:24:37 (24152): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f

    2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f

    2022-01-07 22:04:39 (24152): Status Report: Elapsed Time: '6000.000000'
    2022-01-07 22:04:39 (24152): Status Report: CPU Time: '9.359375'
    2022-01-07 23:44:43 (24152): Status Report: Elapsed Time: '12000.000000'
    2022-01-07 23:44:43 (24152): Status Report: CPU Time: '9.687500'
    2022-01-08 01:24:48 (24152): Status Report: Elapsed Time: '18000.000000'
    2022-01-08 01:24:48 (24152): Status Report: CPU Time: '9.953125'
    2022-01-08 03:04:52 (24152): Status Report: Elapsed Time: '24000.000000'
    2022-01-08 03:04:52 (24152): Status Report: CPU Time: '10.109375'
    2022-01-08 04:44:57 (24152): Status Report: Elapsed Time: '30000.000000'
    2022-01-08 04:44:57 (24152): Status Report: CPU Time: '10.484375'
    2022-01-08 06:25:01 (24152): Status Report: Elapsed Time: '36000.000000'
    2022-01-08 06:25:01 (24152): Status Report: CPU Time: '10.828125'
    2022-01-08 08:05:05 (24152): Status Report: Elapsed Time: '42000.000000'
    2022-01-08 08:05:05 (24152): Status Report: CPU Time: '11.218750'
    2022-01-08 09:45:09 (24152): Status Report: Elapsed Time: '48000.000000'
    2022-01-08 09:45:09 (24152): Status Report: CPU Time: '11.578125'
    2022-01-08 11:25:14 (24152): Status Report: Elapsed Time: '54000.000000'
    2022-01-08 11:25:14 (24152): Status Report: CPU Time: '11.765625'

ID: 46026 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 46027 - Posted: 8 Jan 2022, 11:08:11 UTC - in response to Message 46026.  

When you are using only the default from Atlas-Prefs,
it can be making problems with CPU's in combination with usable RAM, also network (LAN or WiFi).
The first thing is to reduce CPU's and how many Atlas parallel.
You can read about this in the thread of Atlas.
ID: 46027 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 46109 - Posted: 25 Jan 2022, 19:18:38 UTC - in response to Message 46004.  

the bad WUs which appeared here recently are back :-(
I had several ones since last night ...
meanwhile, I've had quite a number of these faulty tasks spread over all of my machines.
Is it not possible back there to stop them instead of sending them out?
tasks which error out after about 10-12 minutes are back:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=340549753
ID: 46109 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 46110 - Posted: 26 Jan 2022, 9:44:12 UTC - in response to Message 45911.  

The question was coming from David, so we have to wait for a answer from the Atlas-Team.


There were some issues with central databases at CERN around the time the problems were being reported, so that could have been the cause of the stuck or failing WU.

However I see that the vboxwrapper we use is from 2017(!) and should be updated. I have updated the ATLAS app on the LHC-dev project to use the latest version so please give it a try if you have an account there. If it looks good I will update it here, but probably not before the new year.

We have in dev a testing for new -native and for a new wrapper for Windows.

Is it possible to transfer it to Production?
ID: 46110 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 46454 - Posted: 17 Mar 2022, 20:47:44 UTC

I just noticed that on all my machines most ATLAS tasks are failing.
The do start after download, but there is NO CPU usage, and the VM console cannot be opened. And until these tasks are stopped manually, the run and run and run ...
anyone making the same experience ?
ID: 46454 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 46456 - Posted: 17 Mar 2022, 21:15:14 UTC - in response to Message 46454.  

I just noticed that on all my machines most ATLAS tasks are failing.
The do start after download, but there is NO CPU usage, and the VM console cannot be opened. And until these tasks are stopped manually, the run and run and run ...
anyone making the same experience ?
So I downloaded Theory tasks, and they also fail: the VM console says right at the beginning: ERROR: Could not source logging functions ...

So there seems to be some kind of glitch somewhere at LHC, or on all of my computers ???
ID: 46456 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,182,997
RAC: 106,505
Message 46458 - Posted: 18 Mar 2022, 6:58:54 UTC - in response to Message 46456.  

I'd say its there side all of my computers are very low utilization
ID: 46458 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 46459 - Posted: 18 Mar 2022, 7:20:29 UTC - in response to Message 46458.  

No conflict seeing for me. 60 Tasks active (Atlas(19), CMS(25), Theory(16)) with RedHat-CentOS8 squid.
ID: 46459 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,902,177
RAC: 2,796
Message 46760 - Posted: 8 May 2022, 17:49:37 UTC

This task have now 5 days runtime with 450k CPU-Time:
<active_task>
<project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url>
<result_name>8ReKDmRp050np2BDcpmwOghnABFKDmABFKDmZZzTDmYVTKDmZDbprn_1</result_name>
<checkpoint_cpu_time>456505.700000</checkpoint_cpu_time>
<checkpoint_elapsed_time>457317.885666</checkpoint_elapsed_time>
<fraction_done>0.000000</fraction_done>
<peak_working_set_size>66056192</peak_working_set_size>
<peak_swap_size>84889600</peak_swap_size>
<peak_disk_usage>3209721126</peak_disk_usage>
</active_task>
ID: 46760 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,425,566
RAC: 0
Message 46761 - Posted: 8 May 2022, 18:33:02 UTC - in response to Message 46760.  

Did you check the terminals inside the VM to see if it is doing anything and what stage it is on?
ID: 46761 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1811
Credit: 118,327,148
RAC: 26,005
Message 47176 - Posted: 26 Aug 2022, 11:46:22 UTC

for the last few hours, all my ATLAS tasks on all of my computers are failing. They keep running, but no CPU usage, and VM console_2 shows N/A for each core.
I then aborted the task manually.

Examples:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10688539

What's going wrong?
ID: 47176 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : ATLAS application : Bad WUs?


©2024 CERN