Message boards :
LHCb Application :
New version v1.05
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
This new version uses openhtc.io for CVMFS and can use multiple cores. |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Interesting, how does it work exactly? Is one job calculated by one or more cores (like Atlas), or are several jobs calculated by one core (like Theory)? Does it affect RAM requirements? Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 254,120,084 RAC: 53,273 |
Interesting, how does it work exactly? Like Theory. Each VM runs as many jobs as cores are configured. Jobs are running independent from each other except that they share data from the VM's CVMFS. As the scientific app is still the same, a 1-core setup would behave like the old app. Thus the minimum RAM setting is still 2048 MB. Each additional job (core) requests additional 1300 MB RAM. Design time limit remains 12 h. The VM will then shutdown 10 min after the last job has finished. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,664,514 RAC: 39,380 |
Design time limit remains 12 h.I downloaded the new version plus 1 task. For the first 1%, it took 11 minutes. Which, if the progress is linear, will end up in a total processing time of 1100 minutes or 18+ hours. |
Send message Joined: 14 Jan 10 Posts: 1419 Credit: 9,474,701 RAC: 2,980 |
Correct!Design time limit remains 12 h.I downloaded the new version plus 1 task. For the first 1%, it took 11 minutes. Which, if the progress is linear, will end up in a total processing time of 1100 minutes or 18+ hours. Like computezrmle wrote, after the running job at the 12 hour mark has finished, the VM will be shutdown after about 10 minutes. When a job is running more than 6 hours after that 12 hour mark or for other reasons keeps running, the VM will be stopped after 18 hours by the wrapper to avoid wasteless crunching. This maximum of 18 hours is used for calculation of the remaining run time. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,664,514 RAC: 39,380 |
hm, actually, the task got finished after about 25 minutes |
Send message Joined: 14 Jan 10 Posts: 1419 Credit: 9,474,701 RAC: 2,980 |
hm, actually, the task got finished after about 25 minutesWhen the VM did not get a new job within 10 minutes after the VM have finished one, the VM will shutdown gracefully and will report a valid task. When the VM did not run one single job, it will shutdown after 10 minutes and BOINC will report an error. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
This LHCb task looked to be running fine but, on closer inspection, it stopped doing any work after about 7hrs runtime yet failed to terminate until after the 18hr cutoff. Only 1hr actual CPU time reported of the 18hr Elapsed. From end of the stderr.log: 2018-09-18 08:30:33 (9960): Guest Log: [INFO] Condor JobID: 470428.110 in slot1 2018-09-18 08:30:44 (9960): Guest Log: [INFO] Starting pilot in slot1 2018-09-18 08:52:39 (9960): Guest Log: [INFO] Job finished in slot1 with . 2018-09-18 09:04:34 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 09:04:34 (9960): Status Report: Elapsed Time: '24002.510773' 2018-09-18 09:04:34 (9960): Status Report: CPU Time: '3301.468750' 2018-09-18 10:44:40 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 10:44:40 (9960): Status Report: Elapsed Time: '30002.510773' 2018-09-18 10:44:40 (9960): Status Report: CPU Time: '3395.187500' 2018-09-18 12:24:46 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 12:24:46 (9960): Status Report: Elapsed Time: '36002.510773' 2018-09-18 12:24:46 (9960): Status Report: CPU Time: '3487.781250' 2018-09-18 14:04:53 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 14:04:53 (9960): Status Report: Elapsed Time: '42002.510773' 2018-09-18 14:04:53 (9960): Status Report: CPU Time: '3576.890625' 2018-09-18 15:44:59 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 15:44:59 (9960): Status Report: Elapsed Time: '48002.510773' 2018-09-18 15:44:59 (9960): Status Report: CPU Time: '3666.281250' 2018-09-18 17:25:05 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 17:25:05 (9960): Status Report: Elapsed Time: '54002.510773' 2018-09-18 17:25:05 (9960): Status Report: CPU Time: '3756.390625' 2018-09-18 19:05:12 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 19:05:12 (9960): Status Report: Elapsed Time: '60002.510773' 2018-09-18 19:05:12 (9960): Status Report: CPU Time: '3844.156250' 2018-09-18 20:25:16 (9960): Powering off VM. 2018-09-18 20:30:17 (9960): VM did not power off when requested. 2018-09-18 20:30:17 (9960): VM was successfully terminated. 2018-09-18 20:30:17 (9960): Deregistering VM. (boinc_4de38852e270c880, slot#2) 2018-09-18 20:30:17 (9960): Removing network bandwidth throttle group from VM. 2018-09-18 20:30:17 (9960): Removing VM from VirtualBox. 20:30:22 (9960): called boinc_finish(0) Last few lines of the Machine logs: 09/18/18 09:52:39 (pid:9352) Process exited, pid=9365, status=0 09/18/18 09:54:03 (pid:9352) attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (173 to go). 09/18/18 09:54:44 (pid:9352) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43343818 09/18/18 10:13:39 (pid:9352) CCBListener: failed to receive message from CCB server vccondor01.cern.ch 09/18/18 10:13:39 (pid:9352) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/18/18 10:15:33 (pid:9352) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43343818 I reported on -dev that tasks only had about 1/3 CPU time versus Elapsed but didn't look closer at those logs. Maybe those ones did similar to this one? Going for a nosey to compare. [Multiple Edits] On those -dev tasks, 1 shows jobs running throughout but only used 85mins of 12hrs and another had 7hrs+ CPU time of 18hrs but 9hrs of no activity. Over there and here, all tasks have run and returned jobs (hopefully successfully) but don't seem to self terminate when they fail to get subsequent work so sit idle for the remaining time. I've set "Switch Between Apps" sufficiently high that Tasks should run uninterrupted so it's not a suspend/resume issue. Hoping this is useful for diagnostics. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 254,120,084 RAC: 53,273 |
LHCb should work fine since about 15:15 UTC. WUs that started before may be affected by unspecific issues. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,139,423 RAC: 29,601 |
Got 52 tasks last night. Estimated runtime is 28 minutes each, actual 12-18 hours. Tasks are configured to run using 4 cores. No way to finish them all in time if jobs are continuously available. Anyway I let them run and those not started by deadline will be auto-aborted. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,664,514 RAC: 39,380 |
I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. Is this by desing, or are these tasks faulty? |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 254,120,084 RAC: 53,273 |
Sad to say, it's not really working. The jobs inside the VM are missing some data from or a connection to a backend system at CERN. Thus they stand by until a timeout shuts them down and the VM requests another job. I set my systems to NNT. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,664,514 RAC: 39,380 |
Sad to say, it's not really working.we seem to have bad times with LHC lately: - Theory not working - LHCb not working - No Sixtrack for long time - only ATLAS seems to function the way it's supposed to. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Sad to say, it's not really working. How do you know they are merely standing by and not doing any work? Are there any such indications in the running.log or finished.log files? |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,664,514 RAC: 39,380 |
How do you know they are merely standing by and not doing any work?I guess this should be evidence enough: I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
How do you know they are merely standing by and not doing any work?I guess this should be evidence enough:I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. Yes. But in the running.log or finished.log files, anything there? |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 254,120,084 RAC: 53,273 |
If you complete the picture with - router monitoring - proxy monitoring (if you have one) - host monitoring - VM's console output (job's starting/ending time, top overview) - ... Then you get an idea how a successful task pattern looks like and how typical errors affect this pattern. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,664,514 RAC: 39,380 |
I forgot to kill a running LHCb tasks on one of my PCs - so I checked the properties a minute ago: Runtime 12+ hours, processor time about 1 hour. These tasks are definitely faulty. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,139,423 RAC: 29,601 |
Got 52 tasks last night. Estimated runtime is 28 minutes each, actual 12-18 hours. Tasks are configured to run using 4 cores. No way to finish them all in time if jobs are continuously available. Anyway I let them run and those not started by deadline will be auto-aborted. Referring to my earlier post. As we now haven't any jobs available for these tasks, the predicted estimated runtime may very well be a reality. Now these tasks are all failing after about 21 minutes with EXIT_NO_SUB_TASKS and maybe I can return them before deadline. Sadly with an error on most of them :-( |
Send message Joined: 28 Sep 04 Posts: 6 Credit: 286,154 RAC: 0 |
Weird Things in lhcb 1.05 log:
2018-09-28 22:16:21 (6464): Detected: BOINC client v7.7 2018-09-28 22:16:22 (6464): Detected: VirtualBox VboxManage Interface (Version: 5.2.8) 2018-09-28 22:16:22 (6464): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2018-09-28 22:16:22 (6464): Successfully copied 'init_data.xml' to the shared directory. 2018-09-28 22:16:23 (6464): Create VM. (boinc_58488727ebc83fb2, slot#9) 2018-09-28 22:16:23 (6464): Setting Memory Size for VM. (11148MB) 2018-09-28 22:16:24 (6464): Setting CPU Count for VM. (8) [snip] 2018-09-28 22:16:33 (6464): Successfully started VM. (PID = '18968') 2018-09-28 22:16:33 (6464): Reporting VM Process ID to BOINC. 2018-09-28 22:16:33 (6464): Guest Log: BIOS: VirtualBox 5.2.8
2018-09-28 22:21:28 (6464): Guest Log: [INFO] Condor JobID: 473308.169 in slot1 2018-09-28 22:21:28 (6464): Guest Log: [INFO] New Job Starting in slot2 2018-09-28 22:21:28 (6464): Guest Log: [INFO] Condor JobID: 473315.147 in slot2 2018-09-28 22:21:42 (6464): Guest Log: [INFO] Starting pilot in slot1 2018-09-28 22:21:42 (6464): Guest Log: [INFO] Starting pilot in slot2 2018-09-28 22:41:27 (6464): Guest Log: [INFO] Job finished in slot1 with . 2018-09-28 22:41:27 (6464): Guest Log: [INFO] Job finished in slot2 with . 2018-09-28 22:41:46 (6464): Guest Log: [IN[FION]F ON]e wN eJwo bJ oSbt aSrttairntgin g iin ns lsolto3t 2018-09-28 22:41:46 (6464): Guest Log: [IN[FION]F ON]e wN eJwo bJ oSbt aSrttairntgin g iin ns lsolto3t4 2018-09-28 22:41:46 (6464): Guest Log: [INFO] Condor JobID: 473325.157 in slot3 2018-09-28 22:41:46 (6464): Guest Log: [INFO] Condor JobID: 473325.159 in slot4 2018-09-28 22:41:56 (6464): Guest Log: [IN[FOI]N FSOt]a rSttiangr tpiinlgo tp iilno ts liont s3lo 2018-09-28 23:01:32 (6464): Guest Log: [INFO] Job finished in slot3 with . 2018-09-28 23:01:32 (6464): Guest Log: [INFO] Job finished in slot4 with . 2018-09-28 23:01:48 (6464): Guest Log: [INFO] New Job Starting in slot6 2018-09-28 23:01:48 (6464): Guest Log: [INFO] New Job Starting in slot1 2018-09-28 23:01:48 (6464): Guest Log: [INFO] Condor JobID: 473325.506 in slot6 2018-09-28 23:01:48 (6464): Guest Log: [INFO] Condor JobID: 473325.502 in slot1 2018-09-28 23:01:58 (6464): Guest Log: [I[NIFNOF]O ]S tSatratritnign gp ipliolto ti ni ns lsolto1t 2018-09-28 23:01:58 (6464): Guest Log: [I[NIFNOF]O ]S tSatratritnign gp ipliolto ti ni ns lsolto1t6 2018-09-28 23:21:28 (6464): Guest Log: [INFO] New Job Starting in slot4 2018-09-28 23:21:28 (6464): Guest Log: [INFO] New Job Starting in slot2 2018-09-28 23:21:29 (6464): Guest Log: [IN[FOI]N FCOo]n dCoorn dJoorb IJDo:b I D4:7 3 342753.382759. 8i8n4 silno ts2l
|
©2024 CERN