Message boards :
ATLAS application :
Bad WUs?
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
At QuChemPedIA@home upgrading BOINC to 7.16.20 solved a problem with security certificates.same was/is true for GPUGRID, among others. But my problem of earlier today was somehting different, obviously. Tasks which I downloaded later are working again. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10631979 2022-01-04 08:05:54 (8452): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2022-01-04 08:05:54 (8452): Guest Log: 2.6.3.0 1439 307445734561825800 29676 99016 63 1 1491998 4096001 0 65024 0 102 98.0392 533 645 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1 2022-01-04 08:05:54 (8452): Guest Log: ATHENA_PROC_NUMBER=2 2022-01-04 08:05:54 (8452): Guest Log: *** Starting ATLAS job. (PandaID=5314311303 taskID=27722525) *** 2022-01-04 08:05:54 (8452): Guest Log: *** Job finished *** 2022-01-04 08:05:54 (8452): Guest Log: *** The last 20 lines of the pilot log: *** 2022-01-04 08:05:54 (8452): Guest Log: /usr/bin/time: cannot run ./runpilot2-wrapper.sh: Permission denied Win11pro. Have this message from Boinc: LHC@home: Notice from BOINC Missing </app_version> in app_config.xml 04.01.2022 08:45:57 app_config: <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</<avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 5000</cmdline> </app_version> </app_config> |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
Have changed app_config.xml from an other Win11pro. <app_config> <app> <name>Theory</name> <max_concurrent>1</max_concurrent> <report_results_immediately>1</report_results_immediately> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>6</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 26250</cmdline> </app_version> </app_config> Your app_config.xml file refers to an unknown application 'Theory'. Known applications: 'CMS', 'sixtrack', 'ATLAS' 04.01.2022 11:22:52 Will checking again, don't know why this difficult using of app_config. |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,126,683 RAC: 132,642 |
See: https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration Strictly follow the template given there. <report_results_immediately>1</report_results_immediately> must not be enclosed by <app>...</app>. BOINC's xml files must be stored using ANSI encoding. Using unicode would be wrong. <cmdline>--memory_size_mb 26250</cmdline> Either a typo or a huge waste of RAM. 6-core ATLAS should be set to 8400 MB. ATLAS does not run faster with more RAM, but RAM allocated by the VM would be locked by VirtualBox and never given back to the OS until the VM shuts down. Your app_config.xml file refers to an unknown application 'Theory'. This BOINC client hasn't got a Theory task since the last project reset. app_config.xml changes corresponding tags/values in client_state.xml. The message just mentions that there are no corresponding tags regarding 'Theory'. Download a Theory task, then reload the config files. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
Thanks, testing some RAM-using with more CPU's. Will reduce it to normal high of using. |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
the bad WUs which appeared here recently are back :-( I had several ones since last night, e.g.: https://lhcathome.cern.ch/lhcathome/result.php?resultid=338090618 https://lhcathome.cern.ch/lhcathome/result.php?resultid=338090668 excerpt from stderr: 2022-01-05 02:27:06 (15472): Guest Log: 00:00:10.255596 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 212 552 000ns (GuestNow=1 641 346 026 423 484 000 ns GuestLast=1 641 349 614 636 036 000 ns fSetTimeLastLoop=true ) how nice if you wake up in the morning just to find out that such WUs were blocking a slot all night through :-( |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
the bad WUs which appeared here recently are back :-(meanwhile, I've had quite a number of these faulty tasks spread over all of my machines. Is it not possible back there to stop them instead of sending them out? |
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,434,983 RAC: 10,709 |
excerpt from stderr: These 'Radical guest time change' is not the reason for the task hanging around. This time change you see in every result, also the valid ones just after the VM started. Mostly about 1 hour difference noted. You may also see those messages after a suspend and a resume of a task. So there must be another reason. I saw lately 1 task on my machine idle, where 5 VM's started shortly after each other. Just suspended that task with LAIM off. https://lhcathome.cern.ch/lhcathome/result.php?resultid=338111927 Disgarded the snapshot with VirtualBox Manager. Started the VM myself outside of BOINC until the athena's were running. Saved the machine to disk and let BOINC do the rest. |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
These 'Radical guest time change' is not the reason for the task hanging around.oh, yes, you are right: this "Radical guest time change" message can be seen in every finished task. Thanks for the hint. On one of my machines I run 5 tasks concurrently, and once in a while it may happen that one starts shortly after another one. However: yesterday I noticed this problem on two machines where only 1 task is being run - so this would rather preclude the above assumption. What also catches the eye: when this happens, it happens several times within a period of 1-2 days, and then not for a lenghy time thereafter. In fact, I saw this happen for the first time about a week ago, not any time before. And the next time it happened was yesterday. And not only on one given machine. So this would lead me to the assumption that it has to do with the task itself, i.e. the task is faulty. |
Send message Joined: 27 Jun 08 Posts: 1 Credit: 1,040,399 RAC: 185 |
Hello guys, I have 4 WUs which running over 14 hours and progress bar is at 99.999%. CPU utilization is from the beginning around 0%. Today morning, I canceled one of the tasks and when the new started crunching, everything looks good for that new one. This should be output for the stucked WU:
2022-01-07 20:24:06 (24152): Detected: BOINC client v7.7 2022-01-07 20:24:07 (24152): Detected: VirtualBox VboxManage Interface (Version: 6.1.12) 2022-01-07 20:24:07 (24152): Successfully copied 'init_data.xml' to the shared directory. 2022-01-07 20:24:09 (24152): Create VM. (boinc_86d4d189c256c343, slot#1) 2022-01-07 20:24:10 (24152): Setting Memory Size for VM. (10200MB) 2022-01-07 20:24:11 (24152): Setting CPU Count for VM. (8) 2022-01-07 20:24:12 (24152): Setting Chipset Options for VM. 2022-01-07 20:24:13 (24152): Setting Boot Options for VM. 2022-01-07 20:24:13 (24152): Setting Network Configuration for NAT. 2022-01-07 20:24:15 (24152): Enabling VM Network Access. 2022-01-07 20:24:16 (24152): Disabling USB Support for VM. 2022-01-07 20:24:16 (24152): Disabling COM Port Support for VM. 2022-01-07 20:24:17 (24152): Disabling LPT Port Support for VM. 2022-01-07 20:24:18 (24152): Disabling Audio Support for VM. 2022-01-07 20:24:18 (24152): Disabling Clipboard Support for VM. 2022-01-07 20:24:19 (24152): Disabling Drag and Drop Support for VM. 2022-01-07 20:24:20 (24152): Adding storage controller(s) to VM. 2022-01-07 20:24:21 (24152): Adding virtual disk drive to VM. (vm_image.vdi) 2022-01-07 20:24:23 (24152): Adding VirtualBox Guest Additions to VM. 2022-01-07 20:24:24 (24152): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2022-01-07 20:24:25 (24152): forwarding host port 58242 to guest port 80 2022-01-07 20:24:26 (24152): Enabling remote desktop for VM. 2022-01-07 20:24:26 (24152): Required extension pack not installed, remote desktop not enabled. 2022-01-07 20:24:26 (24152): Enabling shared directory for VM. 2022-01-07 20:24:27 (24152): Starting VM using VBoxManage interface. (boinc_86d4d189c256c343, slot#1) 2022-01-07 20:24:33 (24152): Successfully started VM. (PID = '7256') 2022-01-07 20:24:33 (24152): Reporting VM Process ID to BOINC. 2022-01-07 20:24:33 (24152): Guest Log: BIOS: VirtualBox 6.1.12 2022-01-07 20:24:33 (24152): Guest Log: CPUID EDX: 0x178bfbff 2022-01-07 20:24:33 (24152): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2022-01-07 20:24:33 (24152): VM state change detected. (old = 'PoweredOff', new = 'Running') 2022-01-07 20:24:33 (24152): Detected: Web Application Enabled (http://localhost:58242) 2022-01-07 20:24:34 (24152): Preference change detected 2022-01-07 20:24:34 (24152): Setting CPU throttle for VM. (100%) 2022-01-07 20:24:34 (24152): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 900 seconds)) 2022-01-07 20:24:35 (24152): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2022-01-07 20:24:35 (24152): Guest Log: BIOS: Booting from Hard Disk... 2022-01-07 20:24:37 (24152): Guest Log: BIOS: KBD: unsupported int 16h function 03 2022-01-07 20:24:37 (24152): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f 2022-01-07 22:04:39 (24152): Status Report: Elapsed Time: '6000.000000' 2022-01-07 22:04:39 (24152): Status Report: CPU Time: '9.359375' 2022-01-07 23:44:43 (24152): Status Report: Elapsed Time: '12000.000000' 2022-01-07 23:44:43 (24152): Status Report: CPU Time: '9.687500' 2022-01-08 01:24:48 (24152): Status Report: Elapsed Time: '18000.000000' 2022-01-08 01:24:48 (24152): Status Report: CPU Time: '9.953125' 2022-01-08 03:04:52 (24152): Status Report: Elapsed Time: '24000.000000' 2022-01-08 03:04:52 (24152): Status Report: CPU Time: '10.109375' 2022-01-08 04:44:57 (24152): Status Report: Elapsed Time: '30000.000000' 2022-01-08 04:44:57 (24152): Status Report: CPU Time: '10.484375' 2022-01-08 06:25:01 (24152): Status Report: Elapsed Time: '36000.000000' 2022-01-08 06:25:01 (24152): Status Report: CPU Time: '10.828125' 2022-01-08 08:05:05 (24152): Status Report: Elapsed Time: '42000.000000' 2022-01-08 08:05:05 (24152): Status Report: CPU Time: '11.218750' 2022-01-08 09:45:09 (24152): Status Report: Elapsed Time: '48000.000000' 2022-01-08 09:45:09 (24152): Status Report: CPU Time: '11.578125' 2022-01-08 11:25:14 (24152): Status Report: Elapsed Time: '54000.000000' 2022-01-08 11:25:14 (24152): Status Report: CPU Time: '11.765625' |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
When you are using only the default from Atlas-Prefs, it can be making problems with CPU's in combination with usable RAM, also network (LAN or WiFi). The first thing is to reduce CPU's and how many Atlas parallel. You can read about this in the thread of Atlas. |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
tasks which error out after about 10-12 minutes are back:the bad WUs which appeared here recently are back :-(meanwhile, I've had quite a number of these faulty tasks spread over all of my machines. https://lhcathome.cern.ch/lhcathome/result.php?resultid=340549753 |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
The question was coming from David, so we have to wait for a answer from the Atlas-Team. We have in dev a testing for new -native and for a new wrapper for Windows. Is it possible to transfer it to Production? |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
I just noticed that on all my machines most ATLAS tasks are failing. The do start after download, but there is NO CPU usage, and the VM console cannot be opened. And until these tasks are stopped manually, the run and run and run ... anyone making the same experience ? |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
I just noticed that on all my machines most ATLAS tasks are failing.So I downloaded Theory tasks, and they also fail: the VM console says right at the beginning: ERROR: Could not source logging functions ... So there seems to be some kind of glitch somewhere at LHC, or on all of my computers ??? |
Send message Joined: 27 Sep 08 Posts: 831 Credit: 688,603,106 RAC: 139,819 |
I'd say its there side all of my computers are very low utilization |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
No conflict seeing for me. 60 Tasks active (Atlas(19), CMS(25), Theory(16)) with RedHat-CentOS8 squid. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,822,225 RAC: 18,477 |
This task have now 5 days runtime with 450k CPU-Time: <active_task> <project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url> <result_name>8ReKDmRp050np2BDcpmwOghnABFKDmABFKDmZZzTDmYVTKDmZDbprn_1</result_name> <checkpoint_cpu_time>456505.700000</checkpoint_cpu_time> <checkpoint_elapsed_time>457317.885666</checkpoint_elapsed_time> <fraction_done>0.000000</fraction_done> <peak_working_set_size>66056192</peak_working_set_size> <peak_swap_size>84889600</peak_swap_size> <peak_disk_usage>3209721126</peak_disk_usage> </active_task> |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0 |
Did you check the terminals inside the VM to see if it is doing anything and what stage it is on? |
Send message Joined: 18 Dec 15 Posts: 1786 Credit: 117,409,521 RAC: 75,222 |
for the last few hours, all my ATLAS tasks on all of my computers are failing. They keep running, but no CPU usage, and VM console_2 shows N/A for each core. I then aborted the task manually. Examples: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10688539 What's going wrong? |
©2024 CERN