Message boards : Theory Application : Problem of the day
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 48998 - Posted: 8 Dec 2023, 23:19:54 UTC

2023-12-08 14:40:44 (26036): Guest Log: 14:40:45 CET +01:00 2023-12-08: cranky: [ERROR] 'cvmfs_config probe sft.cern.ch' failed.
ID: 48998 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,938,919
RAC: 127,891
Message 49073 - Posted: 27 Dec 2023, 8:41:19 UTC

This Theory native task requested >70 GB RAM before it failed with status code 1:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=403724017

Run time 	38 min 30 sec
CPU time 	25 min 47 sec
Validate state 	Valid
Credit 	39.48
Device peak FLOPS 	7.38 GFLOPS
Application version 	Theory Simulation v300.08 (native_theory)
x86_64-pc-linux-gnu
Peak working set size 	70.16 GB
Peak swap size 	70.74 GB
Peak disk usage 	7.74 MB

08:46:58 CET +01:00 2023-12-27: cranky: [INFO] mcplots runspec: boinc pp jets 13000 15 - phojet 1.12a default 100000 65
.
.
.
09:25:20 CET +01:00 2023-12-27: cranky: [INFO] Container Theory_2673-2291577-65_0 finished with status code 1.
.
.
.
09:25:20 (95067): cranky exited; CPU time 278.290212
ID: 49073 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,938,919
RAC: 127,891
Message 49157 - Posted: 9 Jan 2024, 7:00:33 UTC

This Theory native task requested >38 GB RAM before it failed with status code 1:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=404260387

The process consuming nearly all of those 38 GB was "rivetvm.exe".

Run time 	37 min. 19 sek.
CPU time 	29 min. 43 sek.
Application version 	Theory Simulation v300.08 (native_theory)
x86_64-pc-linux-gnu
Peak working set size 	37.96 GB
Peak swap size 	38.40 GB
Peak disk usage 	3.38 MB

07:10:52 CET +01:00 2024-01-09: cranky: [INFO] mcplots runspec: boinc pp jets 13000 900 - phojet 1.12a default 100000 16
.
.
.
07:48:07 CET +01:00 2024-01-09: cranky: [INFO] Container Theory_2687-2542715-16_2 finished with status code 1.
.
.
.
07:48:07 (112994): cranky exited; CPU time 1781.554838
ID: 49157 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49158 - Posted: 9 Jan 2024, 7:21:51 UTC

There are some phojet 1.12a in mcPlots, but only with work from one user.
ID: 49158 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49231 - Posted: 23 Jan 2024, 4:45:14 UTC - in response to Message 49158.  

2024-01-23 05:30:32 (21140): Guest Log: 05:30:38 CET +01:00 2024-01-23: cranky: [INFO] Checking CVMFS.
2024-01-23 05:30:33 (21140): Guest Log: Probing /cvmfs/sft.cern.ch... Failed!
2024-01-23 05:30:33 (21140): Guest Log: 05:30:38 CET +01:00 2024-01-23: cranky: [ERROR] 'cvmfs_config probe sft.cern.ch' failed.
ID: 49231 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49532 - Posted: 12 Feb 2024, 13:47:13 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=406079579
2024-02-12 14:10:42 (13076): Guest Log: job: htmld=/shared/html/job
2024-02-12 14:10:49 (13076): Guest Log: tar: ./.gitignore: Cannot change ownership to uid 19256, gid 1399: Invalid argument
2024-02-12 14:10:49 (13076): Guest Log: tar: ./alpgen/README: Cannot change ownership to uid 19256, gid 1399: Invalid argument
2024-02-12 14:10:49 (13076): Guest Log: tar: ./alpgen/example_dev_alp/Makefile: Cannot change ownership to uid 19256, gid 1399: Invalid argument
2024-02-12 14:10:49 (13076): Guest Log: tar: ./alpgen/example_dev_alp/Makefile.alpgen: Cannot change ownership to uid 19256, gid 1399: Invalid argument
ID: 49532 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49533 - Posted: 12 Feb 2024, 18:55:31 UTC - in response to Message 49532.  

Have stopped Theory under Win11pro.
Seeing also Tasks from other Volunteers with no successful end.
ID: 49533 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 90
Credit: 15,101,160
RAC: 30,979
Message 49535 - Posted: 13 Feb 2024, 3:01:58 UTC - in response to Message 49231.  

2024-01-23 05:30:32 (21140): Guest Log: 05:30:38 CET +01:00 2024-01-23: cranky: [INFO] Checking CVMFS.
2024-01-23 05:30:33 (21140): Guest Log: Probing /cvmfs/sft.cern.ch... Failed!
2024-01-23 05:30:33 (21140): Guest Log: 05:30:38 CET +01:00 2024-01-23: cranky: [ERROR] 'cvmfs_config probe sft.cern.ch' failed.


Same happening here, only I am running everything under VBox.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=406100207
ID: 49535 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49536 - Posted: 13 Feb 2024, 5:24:53 UTC - in response to Message 49535.  

2024-02-12 19:56:10 (15040): Guest Log: 19:56:09 CET +01:00 2024-02-12: cranky: [INFO] Container 'runc' finished with status code 0.
2024-02-12 19:56:10 (15040): Guest Log: 19:56:09 CET +01:00 2024-02-12: cranky: [INFO] Preparing output.
2024-02-12 19:56:10 (15040): Guest Log: 19:56:09 CET +01:00 2024-02-12: cranky: [ERROR] No output found.
Task running normal, no problem with CVMFS, BUT,
the outputfile is not transfered to CERN-IT.
ID: 49536 · Report as offensive     Reply Quote
Ryan Munro

Send message
Joined: 17 Aug 17
Posts: 81
Credit: 8,410,301
RAC: 4,238
Message 49559 - Posted: 14 Feb 2024, 21:45:27 UTC

All the vbox units seem to be failing here as well
ID: 49559 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 90
Credit: 15,101,160
RAC: 30,979
Message 49580 - Posted: 16 Feb 2024, 9:48:15 UTC

New error today. Nearly 30 of them, all reported just after 0900 UTC 16 Feb:

2024-02-16 03:19:00 (32167): Adding storage controller(s) to VM.
2024-02-16 03:19:00 (32167): Adding virtual disk drive to VM. (Theory_2023_12_13.vdi)
2024-02-16 03:19:05 (32167): Error in deregister parent vdi for VM: -2135228404
Command:
VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Output:
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media
VBoxManage: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee nsISupports
VBoxManage: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

2024-02-16 03:19:05 (32167): Could not create VM
2024-02-16 03:19:05 (32167): ERROR: VM failed to start
2024-02-16 03:19:05 (32167): Powering off VM.
2024-02-16 03:19:05 (32167): Deregistering VM. (boinc_d0135c6cd87fd305, slot#12)
2024-02-16 03:19:05 (32167): Removing network bandwidth throttle group from VM.
2024-02-16 03:19:05 (32167): Removing VM from VirtualBox.


and then from the VM trace log:

2024-02-16 03:19:00 (32167): 
Command: VBoxManage -q storageattach "boinc_d0135c6cd87fd305" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Exit Code: -2135228409
Output:
VBoxManage: error: Cannot attach medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports
VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 785 of file VBoxManageStorageController.cpp

2024-02-16 03:19:00 (32167): 
Command: VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi" 
Exit Code: -2135228404
Output:
VBoxManage: error: Cannot close medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/Theory_2023_12_13.vdi' because it has 2 child media
VBoxManage: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee nsISupports
VBoxManage: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

Note in particular the line that says "...can only be attached to machines that were created with VirtualBox 4.0 or later", which is very strange because I am running version 7.0.12
ID: 49580 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49581 - Posted: 16 Feb 2024, 10:02:49 UTC - in response to Message 49580.  


Note in particular the line that says "...can only be attached to machines that were created with VirtualBox 4.0 or later", which is very strange because I am running version 7.0.12

This line seeing often in the past.
It must be a problem with multiattach i thinking.
Only, Cern-IT, (Laurence and the Team) are possible to find a solution.
ID: 49581 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 334
Credit: 4,833,247
RAC: 1,897
Message 49602 - Posted: 19 Feb 2024, 19:47:48 UTC
Last modified: 19 Feb 2024, 19:49:42 UTC

What this?
2024-02-19 16:20:37 (21232): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2024-02-19 16:20:45 (21232): Guest Log: 00:57:41.410265 timesync vgsvcTimeSyncWorker: Radical host time change: 2 046 321 000 000ns (HostNow=1 708 356 041 695 000 000 ns HostLast=1 708 353 995 374 000 000 ns)
2024-02-19 16:20:52 (21232): Guest Log: 00:57:51.417034 timesync vgsvcTimeSyncWorker: Radical guest time change: 2 046 347 637 000ns (GuestNow=1 708 356 051 728 222 000 ns GuestLast=1 708 354 005 380 585 000 ns fSetTimeLastLoop=true )
2024-02-19 17:13:25 (21232): Status Report: Job Duration: '864000.000000'
2024-02-19 17:13:25 (21232): Status Report: Elapsed Time: '6004.483085'
2024-02-19 17:13:25 (21232): Status Report: CPU Time: '6404.296875'
2024-02-19 17:36:43 (21232): Guest Log: job: run exitcode=0
2024-02-19 17:36:43 (21232): Guest Log: job: diskusage=4132
2024-02-19 17:36:43 (21232): Guest Log: job: logsize=72 k
2024-02-19 17:36:43 (21232): Guest Log: job: times=
2024-02-19 17:36:43 (21232): Guest Log: 0m0.008s 0m0.012s
2024-02-19 17:36:43 (21232): Guest Log: 128m8.477s 0m45.881s
2024-02-19 17:36:43 (21232): Guest Log: job: cpuusage=7734
2024-02-19 17:36:43 (21232): Guest Log: 17:36:43 CET +01:00 2024-02-19: cranky: [INFO] Container 'runc' finished with status code 0.
2024-02-19 17:36:43 (21232): Guest Log: 17:36:43 CET +01:00 2024-02-19: cranky: [INFO] Preparing output.
2024-02-19 17:36:43 (21232): Guest Log: 17:36:43 CET +01:00 2024-02-19: cranky: [ERROR] No output found.
2024-02-19 17:36:43 (21232): Guest Log: [ERROR] Job Failed
2024-02-19 17:36:43 (21232): Guest Log: [INFO] Shutting Down.
2024-02-19 17:36:43 (21232): VM Completion File Detected.
2024-02-19 17:36:43 (21232): VM Completion Message: Job Failed

Radical host time change???
What burped an hour in to make it fail?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=406272110

You will see it stop and powers off and then restarts. I have other projects that I do, so this tasks time was up for the moment and another project started in it place. Then it restarts and runs an hour and dies.
ID: 49602 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1409
Credit: 9,325,730
RAC: 9,392
Message 49603 - Posted: 19 Feb 2024, 20:38:41 UTC - in response to Message 49602.  
Last modified: 19 Feb 2024, 20:44:56 UTC

Radical host time change???
What burped an hour in to make it fail?

https://lhcathome.cern.ch/lhcathome/result.php?resultid=406272110

You will see it stop and powers off and then restarts. I have other projects that I do, so this tasks time was up for the moment and another project started in it place. Then it restarts and runs an hour and dies.
That was not the reason for the task to fail. The workunit was created during the period 12 Feb until about 17 Feb 2300 CET.
All tasks and their resends fail during that period and for resends also thereafter.

The mentioned radical time change is because the Linux VM always uses UTC and your Windows host your local time without reference to UTC.
You may prevent this:
For 64-bit Windows, open regedit then browse to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\TimeZoneInformation. Create a new QWORD entry called RealTimeIsUniversal , then set its value to 1 . Reboot the system. The clock should now be in UTC time.
ID: 49603 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,938,919
RAC: 127,891
Message 49788 - Posted: 18 Mar 2024, 12:43:51 UTC

Today I had another phojet task continuously eating up all available RAM.
So far >60 GB RAM within less than 30 min runtime.

runRivet.log shows >35000 lines like this:
0 events processed

and very few lines like this:
Rivet.AnalysisHandler: WARN Sub-event weight list has 2000 elements: are the weight numbers correctly set in the input events?
ID: 49788 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2519
Credit: 250,938,919
RAC: 127,891
Message 49789 - Posted: 19 Mar 2024, 21:29:22 UTC
Last modified: 19 Mar 2024, 21:34:10 UTC

Just killed another rogue phojet eating up >60GB RAM.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=408012071

Theory_2687-2528715-1157_1
21:49:00 CET +01:00 2024-03-19: cranky: [INFO] mcplots runspec: boinc pp jets 13000 430 - phojet 1.12a default 100000 1157
ID: 49789 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 49950 - Posted: 15 Apr 2024, 22:35:22 UTC
Last modified: 15 Apr 2024, 22:37:12 UTC

ID: 49950 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 50035 - Posted: 25 Apr 2024, 7:44:49 UTC - in response to Message 49950.  

cvmfs_config probe sft.cern.ch failed
Theory https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=222398066
ID: 50035 · Report as offensive     Reply Quote
M0CZY

Send message
Joined: 27 Apr 24
Posts: 10
Credit: 563,349
RAC: 1,808
Message 50456 - Posted: 26 Jun 2024, 13:40:39 UTC
Last modified: 26 Jun 2024, 13:47:53 UTC

Previously, my native Theory workunits have been running fine, but now they are failing with
11:18:08 BST +01:00 2024-06-26: cranky-0.1.4: [INFO] Can't find 'runc'.
11:18:08 BST +01:00 2024-06-26: cranky-0.1.4: [ERROR] Major requirements are missing. Can't run this task.
11:18:08 BST +01:00 2024-06-26: cranky-0.1.4: [INFO] Early shutdown initiated due to previous errors.
11:18:08 BST +01:00 2024-06-26: cranky-0.1.4: [INFO] Cleanup will take a few minutes...
11:30:52 (12643): cranky exited; CPU time 0.362101
11:30:52 (12643): app exit status: 0xce
11:30:52 (12643): called boinc_finish(195)


I installed runc version 1.1.13, but now my workunits are failing with
14:28:08 BST +01:00 2024-06-26: cranky-0.1.4: [INFO] Found a local runc version 1.1.13.
14:28:08 BST +01:00 2024-06-26: cranky-0.1.4: [ERROR] Major requirements are missing. Can't run this task.
14:28:08 BST +01:00 2024-06-26: cranky-0.1.4: [INFO] Early shutdown initiated due to previous errors.
14:28:08 BST +01:00 2024-06-26: cranky-0.1.4: [INFO] Cleanup will take a few minutes...

I've no idea what is wrong here. It's a shame that the stderr.txt doesn't tell you how to fix the problem.
ID: 50456 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2220
Credit: 173,696,840
RAC: 24,696
Message 50457 - Posted: 26 Jun 2024, 14:38:06 UTC - in response to Message 50456.  

I've no idea what is wrong here. It's a shame that the stderr.txt doesn't tell you how to fix the problem.

Have you searched here for runc or cgroup?
ID: 50457 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Theory Application : Problem of the day


©2024 CERN