Bad WUs?

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 45934 - Posted: 22 Dec 2021, 16:50:12 UTC - in response to Message 45933. At QuChemPedIA@home upgrading BOINC to 7.16.20 solved a problem with security certificates. Tullio same was/is true for GPUGRID, among others. But my problem of earlier today was somehting different, obviously. Tasks which I downloaded later are working again. ID: 45934 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 45987 - Posted: 4 Jan 2022, 7:22:49 UTC Last modified: 4 Jan 2022, 8:00:59 UTC https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10631979 2022-01-04 08:05:54 (8452): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2022-01-04 08:05:54 (8452): Guest Log: 2.6.3.0 1439 307445734561825800 29676 99016 63 1 1491998 4096001 0 65024 0 102 98.0392 533 645 http://s1cern-cvmfs.openhtc.io/cvmfs/atlas.cern.ch DIRECT 1 2022-01-04 08:05:54 (8452): Guest Log: ATHENA_PROC_NUMBER=2 2022-01-04 08:05:54 (8452): Guest Log: * Starting ATLAS job. (PandaID=5314311303 taskID=27722525) * 2022-01-04 08:05:54 (8452): Guest Log: * Job finished * 2022-01-04 08:05:54 (8452): Guest Log: * The last 20 lines of the pilot log: * 2022-01-04 08:05:54 (8452): Guest Log: /usr/bin/time: cannot run ./runpilot2-wrapper.sh: Permission denied Win11pro. Have this message from Boinc: LHC@home: Notice from BOINC Missing </app_version> in app_config.xml 04.01.2022 08:45:57 app_config: <app_config> <app> <name>ATLAS</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>2.000000</<avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 5000</cmdline> </app_version> </app_config> ID: 45987 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 45991 - Posted: 4 Jan 2022, 10:32:08 UTC - in response to Message 45987. Have changed app_config.xml from an other Win11pro. <app_config> <app> <name>Theory</name> <max_concurrent>1</max_concurrent> <report_results_immediately>1</report_results_immediately> </app> <app_version> <app_name>ATLAS</app_name> <avg_ncpus>6</avg_ncpus> <plan_class>vbox64_mt_mcore_atlas</plan_class> <cmdline>--memory_size_mb 26250</cmdline> </app_version> </app_config> Your app_config.xml file refers to an unknown application 'Theory'. Known applications: 'CMS', 'sixtrack', 'ATLAS' 04.01.2022 11:22:52 Will checking again, don't know why this difficult using of app_config. ID: 45991 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,455,332 RAC: 123,713	Message 45992 - Posted: 4 Jan 2022, 11:12:41 UTC - in response to Message 45991. See: https://boinc.berkeley.edu/wiki/Client_configuration#Project-level_configuration Strictly follow the template given there. <report_results_immediately>1</report_results_immediately> must not be enclosed by <app>...</app>. BOINC's xml files must be stored using ANSI encoding. Using unicode would be wrong. <cmdline>--memory_size_mb 26250</cmdline> Either a typo or a huge waste of RAM. 6-core ATLAS should be set to 8400 MB. ATLAS does not run faster with more RAM, but RAM allocated by the VM would be locked by VirtualBox and never given back to the OS until the VM shuts down. Your app_config.xml file refers to an unknown application 'Theory'. This BOINC client hasn't got a Theory task since the last project reset. app_config.xml changes corresponding tags/values in client_state.xml. The message just mentions that there are no corresponding tags regarding 'Theory'. Download a Theory task, then reload the config files. ID: 45992 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 45993 - Posted: 4 Jan 2022, 11:27:32 UTC - in response to Message 45992. Thanks, testing some RAM-using with more CPU's. Will reduce it to normal high of using. ID: 45993 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 46002 - Posted: 5 Jan 2022, 11:56:48 UTC - in response to Message 45993. the bad WUs which appeared here recently are back :-( I had several ones since last night, e.g.: https://lhcathome.cern.ch/lhcathome/result.php?resultid=338090618 https://lhcathome.cern.ch/lhcathome/result.php?resultid=338090668 excerpt from stderr: 2022-01-05 02:27:06 (15472): Guest Log: 00:00:10.255596 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 212 552 000ns (GuestNow=1 641 346 026 423 484 000 ns GuestLast=1 641 349 614 636 036 000 ns fSetTimeLastLoop=true ) how nice if you wake up in the morning just to find out that such WUs were blocking a slot all night through :-( ID: 46002 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 46004 - Posted: 5 Jan 2022, 15:12:37 UTC - in response to Message 46002. the bad WUs which appeared here recently are back :-( I had several ones since last night ... meanwhile, I've had quite a number of these faulty tasks spread over all of my machines. Is it not possible back there to stop them instead of sending them out? ID: 46004 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155	Message 46008 - Posted: 6 Jan 2022, 7:42:26 UTC - in response to Message 46002. excerpt from stderr: 2022-01-05 02:27:06 (15472): Guest Log: 00:00:10.255596 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 588 212 552 000ns (GuestNow=1 641 346 026 423 484 000 ns GuestLast=1 641 349 614 636 036 000 ns fSetTimeLastLoop=true ) how nice if you wake up in the morning just to find out that such WUs were blocking a slot all night through :-( These 'Radical guest time change' is not the reason for the task hanging around. This time change you see in every result, also the valid ones just after the VM started. Mostly about 1 hour difference noted. You may also see those messages after a suspend and a resume of a task. So there must be another reason. I saw lately 1 task on my machine idle, where 5 VM's started shortly after each other. Just suspended that task with LAIM off. https://lhcathome.cern.ch/lhcathome/result.php?resultid=338111927 Disgarded the snapshot with VirtualBox Manager. Started the VM myself outside of BOINC until the athena's were running. Saved the machine to disk and let BOINC do the rest. ID: 46008 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 46009 - Posted: 6 Jan 2022, 8:59:49 UTC - in response to Message 46008. These 'Radical guest time change' is not the reason for the task hanging around. This time change you see in every result, also the valid ones just after the VM started. Mostly about 1 hour difference noted. You may also see those messages after a suspend and a resume of a task. So there must be another reason. I saw lately 1 task on my machine idle, where 5 VM's started shortly after each other. ... oh, yes, you are right: this "Radical guest time change" message can be seen in every finished task. Thanks for the hint. On one of my machines I run 5 tasks concurrently, and once in a while it may happen that one starts shortly after another one. However: yesterday I noticed this problem on two machines where only 1 task is being run - so this would rather preclude the above assumption. What also catches the eye: when this happens, it happens several times within a period of 1-2 days, and then not for a lenghy time thereafter. In fact, I saw this happen for the first time about a week ago, not any time before. And the next time it happened was yesterday. And not only on one given machine. So this would lead me to the assumption that it has to do with the task itself, i.e. the task is faulty. ID: 46009 · Reply Quote

Michal Kinďura Send message Joined: 27 Jun 08 Posts: 1 Credit: 1,002,557 RAC: 0	Message 46026 - Posted: 8 Jan 2022, 10:42:37 UTC Hello guys, I have 4 WUs which running over 14 hours and progress bar is at 99.999%. CPU utilization is from the beginning around 0%. Today morning, I canceled one of the tasks and when the new started crunching, everything looks good for that new one. This should be output for the stucked WU: 2022-01-07 20:24:06 (24152): Detected: vboxwrapper 26197 2022-01-07 20:24:06 (24152): Detected: BOINC client v7.7 2022-01-07 20:24:07 (24152): Detected: VirtualBox VboxManage Interface (Version: 6.1.12) 2022-01-07 20:24:07 (24152): Successfully copied 'init_data.xml' to the shared directory. 2022-01-07 20:24:09 (24152): Create VM. (boinc_86d4d189c256c343, slot#1) 2022-01-07 20:24:10 (24152): Setting Memory Size for VM. (10200MB) 2022-01-07 20:24:11 (24152): Setting CPU Count for VM. (8) 2022-01-07 20:24:12 (24152): Setting Chipset Options for VM. 2022-01-07 20:24:13 (24152): Setting Boot Options for VM. 2022-01-07 20:24:13 (24152): Setting Network Configuration for NAT. 2022-01-07 20:24:15 (24152): Enabling VM Network Access. 2022-01-07 20:24:16 (24152): Disabling USB Support for VM. 2022-01-07 20:24:16 (24152): Disabling COM Port Support for VM. 2022-01-07 20:24:17 (24152): Disabling LPT Port Support for VM. 2022-01-07 20:24:18 (24152): Disabling Audio Support for VM. 2022-01-07 20:24:18 (24152): Disabling Clipboard Support for VM. 2022-01-07 20:24:19 (24152): Disabling Drag and Drop Support for VM. 2022-01-07 20:24:20 (24152): Adding storage controller(s) to VM. 2022-01-07 20:24:21 (24152): Adding virtual disk drive to VM. (vm_image.vdi) 2022-01-07 20:24:23 (24152): Adding VirtualBox Guest Additions to VM. 2022-01-07 20:24:24 (24152): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2022-01-07 20:24:25 (24152): forwarding host port 58242 to guest port 80 2022-01-07 20:24:26 (24152): Enabling remote desktop for VM. 2022-01-07 20:24:26 (24152): Required extension pack not installed, remote desktop not enabled. 2022-01-07 20:24:26 (24152): Enabling shared directory for VM. 2022-01-07 20:24:27 (24152): Starting VM using VBoxManage interface. (boinc_86d4d189c256c343, slot#1) 2022-01-07 20:24:33 (24152): Successfully started VM. (PID = '7256') 2022-01-07 20:24:33 (24152): Reporting VM Process ID to BOINC. 2022-01-07 20:24:33 (24152): Guest Log: BIOS: VirtualBox 6.1.12 2022-01-07 20:24:33 (24152): Guest Log: CPUID EDX: 0x178bfbff 2022-01-07 20:24:33 (24152): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2022-01-07 20:24:33 (24152): VM state change detected. (old = 'PoweredOff', new = 'Running') 2022-01-07 20:24:33 (24152): Detected: Web Application Enabled (http://localhost:58242) 2022-01-07 20:24:34 (24152): Preference change detected 2022-01-07 20:24:34 (24152): Setting CPU throttle for VM. (100%) 2022-01-07 20:24:34 (24152): Setting checkpoint interval to 900 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 900 seconds)) 2022-01-07 20:24:35 (24152): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2022-01-07 20:24:35 (24152): Guest Log: BIOS: Booting from Hard Disk... 2022-01-07 20:24:37 (24152): Guest Log: BIOS: KBD: unsupported int 16h function 03 2022-01-07 20:24:37 (24152): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=81 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=81 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=82 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=82 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=83 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=83 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=84 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=84 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=85 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=85 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=86 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=86 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=87 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=87 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=88 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=88 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=89 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=89 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8a 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8a 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8b 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8b 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8c 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8c 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8d 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8d 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8e 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8e 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk_ext: function 41, unmapped device for ELDL=8f 2022-01-07 20:24:37 (24152): Guest Log: int13_harddisk: function 02, unmapped device for ELDL=8f 2022-01-07 22:04:39 (24152): Status Report: Elapsed Time: '6000.000000' 2022-01-07 22:04:39 (24152): Status Report: CPU Time: '9.359375' 2022-01-07 23:44:43 (24152): Status Report: Elapsed Time: '12000.000000' 2022-01-07 23:44:43 (24152): Status Report: CPU Time: '9.687500' 2022-01-08 01:24:48 (24152): Status Report: Elapsed Time: '18000.000000' 2022-01-08 01:24:48 (24152): Status Report: CPU Time: '9.953125' 2022-01-08 03:04:52 (24152): Status Report: Elapsed Time: '24000.000000' 2022-01-08 03:04:52 (24152): Status Report: CPU Time: '10.109375' 2022-01-08 04:44:57 (24152): Status Report: Elapsed Time: '30000.000000' 2022-01-08 04:44:57 (24152): Status Report: CPU Time: '10.484375' 2022-01-08 06:25:01 (24152): Status Report: Elapsed Time: '36000.000000' 2022-01-08 06:25:01 (24152): Status Report: CPU Time: '10.828125' 2022-01-08 08:05:05 (24152): Status Report: Elapsed Time: '42000.000000' 2022-01-08 08:05:05 (24152): Status Report: CPU Time: '11.218750' 2022-01-08 09:45:09 (24152): Status Report: Elapsed Time: '48000.000000' 2022-01-08 09:45:09 (24152): Status Report: CPU Time: '11.578125' 2022-01-08 11:25:14 (24152): Status Report: Elapsed Time: '54000.000000' 2022-01-08 11:25:14 (24152): Status Report: CPU Time: '11.765625' ID: 46026 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 46027 - Posted: 8 Jan 2022, 11:08:11 UTC - in response to Message 46026. When you are using only the default from Atlas-Prefs, it can be making problems with CPU's in combination with usable RAM, also network (LAN or WiFi). The first thing is to reduce CPU's and how many Atlas parallel. You can read about this in the thread of Atlas. ID: 46027 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 46109 - Posted: 25 Jan 2022, 19:18:38 UTC - in response to Message 46004. the bad WUs which appeared here recently are back :-( I had several ones since last night ... meanwhile, I've had quite a number of these faulty tasks spread over all of my machines. Is it not possible back there to stop them instead of sending them out? tasks which error out after about 10-12 minutes are back: https://lhcathome.cern.ch/lhcathome/result.php?resultid=340549753 ID: 46109 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 46110 - Posted: 26 Jan 2022, 9:44:12 UTC - in response to Message 45911. The question was coming from David, so we have to wait for a answer from the Atlas-Team. There were some issues with central databases at CERN around the time the problems were being reported, so that could have been the cause of the stuck or failing WU. However I see that the vboxwrapper we use is from 2017(!) and should be updated. I have updated the ATLAS app on the LHC-dev project to use the latest version so please give it a try if you have an account there. If it looks good I will update it here, but probably not before the new year. We have in dev a testing for new -native and for a new wrapper for Windows. Is it possible to transfer it to Production? ID: 46110 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 46454 - Posted: 17 Mar 2022, 20:47:44 UTC I just noticed that on all my machines most ATLAS tasks are failing. The do start after download, but there is NO CPU usage, and the VM console cannot be opened. And until these tasks are stopped manually, the run and run and run ... anyone making the same experience ? ID: 46454 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 46456 - Posted: 17 Mar 2022, 21:15:14 UTC - in response to Message 46454. I just noticed that on all my machines most ATLAS tasks are failing. The do start after download, but there is NO CPU usage, and the VM console cannot be opened. And until these tasks are stopped manually, the run and run and run ... anyone making the same experience ? So I downloaded Theory tasks, and they also fail: the VM console says right at the beginning: ERROR: Could not source logging functions ... So there seems to be some kind of glitch somewhere at LHC, or on all of my computers ??? ID: 46456 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 803 Credit: 650,015,284 RAC: 239,766	Message 46458 - Posted: 18 Mar 2022, 6:58:54 UTC - in response to Message 46456. I'd say its there side all of my computers are very low utilization ID: 46458 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 46459 - Posted: 18 Mar 2022, 7:20:29 UTC - in response to Message 46458. No conflict seeing for me. 60 Tasks active (Atlas(19), CMS(25), Theory(16)) with RedHat-CentOS8 squid. ID: 46459 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2090 Credit: 158,821,277 RAC: 126,578	Message 46760 - Posted: 8 May 2022, 17:49:37 UTC This task have now 5 days runtime with 450k CPU-Time: <active_task> <project_master_url>https://lhcathome.cern.ch/lhcathome/</project_master_url> <result_name>8ReKDmRp050np2BDcpmwOghnABFKDmABFKDmZZzTDmYVTKDmZDbprn_1</result_name> <checkpoint_cpu_time>456505.700000</checkpoint_cpu_time> <checkpoint_elapsed_time>457317.885666</checkpoint_elapsed_time> <fraction_done>0.000000</fraction_done> <peak_working_set_size>66056192</peak_working_set_size> <peak_swap_size>84889600</peak_swap_size> <peak_disk_usage>3209721126</peak_disk_usage> </active_task> ID: 46760 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,231,282 RAC: 5,477	Message 46761 - Posted: 8 May 2022, 18:33:02 UTC - in response to Message 46760. Did you check the terminals inside the VM to see if it is doing anything and what stage it is on? ID: 46761 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1687 Credit: 103,047,123 RAC: 126,420	Message 47176 - Posted: 26 Aug 2022, 11:46:22 UTC for the last few hours, all my ATLAS tasks on all of my computers are failing. They keep running, but no CPU usage, and VM console_2 shows N/A for each core. I then aborted the task manually. Examples: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10688539 What's going wrong? ID: 47176 · Reply Quote

LHC@home