Thread 'New version v1.05'

Author	Message
Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 36745 - Posted: 17 Sep 2018, 12:34:22 UTC This new version uses openhtc.io for CVMFS and can use multiple cores. ID: 36745 · Reply Quote

djoser Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0	Message 36746 - Posted: 17 Sep 2018, 13:11:34 UTC Interesting, how does it work exactly? Is one job calculated by one or more cores (like Atlas), or are several jobs calculated by one core (like Theory)? Does it affect RAM requirements? Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us ID: 36746 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 36748 - Posted: 17 Sep 2018, 15:07:41 UTC - in response to Message 36746. Interesting, how does it work exactly? Is one job calculated by one or more cores (like Atlas), or are several jobs calculated by one core (like Theory)? Does it affect RAM requirements? Like Theory. Each VM runs as many jobs as cores are configured. Jobs are running independent from each other except that they share data from the VM's CVMFS. As the scientific app is still the same, a 1-core setup would behave like the old app. Thus the minimum RAM setting is still 2048 MB. Each additional job (core) requests additional 1300 MB RAM. Design time limit remains 12 h. The VM will then shutdown 10 min after the last job has finished. ID: 36748 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36749 - Posted: 17 Sep 2018, 16:48:51 UTC - in response to Message 36748. Design time limit remains 12 h. I downloaded the new version plus 1 task. For the first 1%, it took 11 minutes. Which, if the progress is linear, will end up in a total processing time of 1100 minutes or 18+ hours. ID: 36749 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36750 - Posted: 17 Sep 2018, 17:34:25 UTC - in response to Message 36749. Last modified: 17 Sep 2018, 17:36:03 UTC Design time limit remains 12 h. I downloaded the new version plus 1 task. For the first 1%, it took 11 minutes. Which, if the progress is linear, will end up in a total processing time of 1100 minutes or 18+ hours. Correct! Like computezrmle wrote, after the running job at the 12 hour mark has finished, the VM will be shutdown after about 10 minutes. When a job is running more than 6 hours after that 12 hour mark or for other reasons keeps running, the VM will be stopped after 18 hours by the wrapper to avoid wasteless crunching. This maximum of 18 hours is used for calculation of the remaining run time. ID: 36750 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36751 - Posted: 17 Sep 2018, 18:30:18 UTC - in response to Message 36750. hm, actually, the task got finished after about 25 minutes ID: 36751 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36753 - Posted: 18 Sep 2018, 13:09:14 UTC - in response to Message 36751. hm, actually, the task got finished after about 25 minutes When the VM did not get a new job within 10 minutes after the VM have finished one, the VM will shutdown gracefully and will report a valid task. When the VM did not run one single job, it will shutdown after 10 minutes and BOINC will report an error. ID: 36753 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 36759 - Posted: 18 Sep 2018, 19:36:40 UTC Last modified: 18 Sep 2018, 20:08:05 UTC This LHCb task looked to be running fine but, on closer inspection, it stopped doing any work after about 7hrs runtime yet failed to terminate until after the 18hr cutoff. Only 1hr actual CPU time reported of the 18hr Elapsed. From end of the stderr.log: 2018-09-18 08:30:33 (9960): Guest Log: [INFO] Condor JobID: 470428.110 in slot1 2018-09-18 08:30:44 (9960): Guest Log: [INFO] Starting pilot in slot1 2018-09-18 08:52:39 (9960): Guest Log: [INFO] Job finished in slot1 with . 2018-09-18 09:04:34 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 09:04:34 (9960): Status Report: Elapsed Time: '24002.510773' 2018-09-18 09:04:34 (9960): Status Report: CPU Time: '3301.468750' 2018-09-18 10:44:40 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 10:44:40 (9960): Status Report: Elapsed Time: '30002.510773' 2018-09-18 10:44:40 (9960): Status Report: CPU Time: '3395.187500' 2018-09-18 12:24:46 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 12:24:46 (9960): Status Report: Elapsed Time: '36002.510773' 2018-09-18 12:24:46 (9960): Status Report: CPU Time: '3487.781250' 2018-09-18 14:04:53 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 14:04:53 (9960): Status Report: Elapsed Time: '42002.510773' 2018-09-18 14:04:53 (9960): Status Report: CPU Time: '3576.890625' 2018-09-18 15:44:59 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 15:44:59 (9960): Status Report: Elapsed Time: '48002.510773' 2018-09-18 15:44:59 (9960): Status Report: CPU Time: '3666.281250' 2018-09-18 17:25:05 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 17:25:05 (9960): Status Report: Elapsed Time: '54002.510773' 2018-09-18 17:25:05 (9960): Status Report: CPU Time: '3756.390625' 2018-09-18 19:05:12 (9960): Status Report: Job Duration: '64800.000000' 2018-09-18 19:05:12 (9960): Status Report: Elapsed Time: '60002.510773' 2018-09-18 19:05:12 (9960): Status Report: CPU Time: '3844.156250' 2018-09-18 20:25:16 (9960): Powering off VM. 2018-09-18 20:30:17 (9960): VM did not power off when requested. 2018-09-18 20:30:17 (9960): VM was successfully terminated. 2018-09-18 20:30:17 (9960): Deregistering VM. (boinc_4de38852e270c880, slot#2) 2018-09-18 20:30:17 (9960): Removing network bandwidth throttle group from VM. 2018-09-18 20:30:17 (9960): Removing VM from VirtualBox. 20:30:22 (9960): called boinc_finish(0) Last few lines of the Machine logs: 09/18/18 09:52:39 (pid:9352) Process exited, pid=9365, status=0 09/18/18 09:54:03 (pid:9352) attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (173 to go). 09/18/18 09:54:44 (pid:9352) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43343818 09/18/18 10:13:39 (pid:9352) CCBListener: failed to receive message from CCB server vccondor01.cern.ch 09/18/18 10:13:39 (pid:9352) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 09/18/18 10:15:33 (pid:9352) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43343818 I reported on -dev that tasks only had about 1/3 CPU time versus Elapsed but didn't look closer at those logs. Maybe those ones did similar to this one? Going for a nosey to compare. [Multiple Edits] On those -dev tasks, 1 shows jobs running throughout but only used 85mins of 12hrs and another had 7hrs+ CPU time of 18hrs but 9hrs of no activity. Over there and here, all tasks have run and returned jobs (hopefully successfully) but don't seem to self terminate when they fail to get subsequent work so sit idle for the remaining time. I've set "Switch Between Apps" sufficiently high that Tasks should run uninterrupted so it's not a suspend/resume issue. Hoping this is useful for diagnostics. ID: 36759 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 36760 - Posted: 18 Sep 2018, 20:13:46 UTC - in response to Message 36759. LHCb should work fine since about 15:15 UTC. WUs that started before may be affected by unspecific issues. ID: 36760 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 36779 - Posted: 19 Sep 2018, 15:30:45 UTC Last modified: 19 Sep 2018, 15:35:30 UTC Got 52 tasks last night. Estimated runtime is 28 minutes each, actual 12-18 hours. Tasks are configured to run using 4 cores. No way to finish them all in time if jobs are continuously available. Anyway I let them run and those not started by deadline will be auto-aborted. ID: 36779 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36837 - Posted: 23 Sep 2018, 15:15:12 UTC I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. Is this by desing, or are these tasks faulty? ID: 36837 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 36838 - Posted: 23 Sep 2018, 15:40:45 UTC - in response to Message 36837. Sad to say, it's not really working. The jobs inside the VM are missing some data from or a connection to a backend system at CERN. Thus they stand by until a timeout shuts them down and the VM requests another job. I set my systems to NNT. ID: 36838 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36839 - Posted: 23 Sep 2018, 16:16:14 UTC - in response to Message 36838. Sad to say, it's not really working. we seem to have bad times with LHC lately: - Theory not working - LHCb not working - No Sixtrack for long time - only ATLAS seems to function the way it's supposed to. ID: 36839 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36841 - Posted: 23 Sep 2018, 16:43:49 UTC - in response to Message 36838. Last modified: 23 Sep 2018, 16:51:58 UTC Sad to say, it's not really working. The jobs inside the VM are missing some data from or a connection to a backend system at CERN. Thus they stand by until a timeout shuts them down and the VM requests another job. How do you know they are merely standing by and not doing any work? Are there any such indications in the running.log or finished.log files? ID: 36841 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36844 - Posted: 23 Sep 2018, 18:15:22 UTC - in response to Message 36841. How do you know they are merely standing by and not doing any work? I guess this should be evidence enough: I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. ID: 36844 · Reply Quote

bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36847 - Posted: 23 Sep 2018, 19:36:16 UTC - in response to Message 36844. How do you know they are merely standing by and not doing any work? I guess this should be evidence enough: I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes. Yes. But in the running.log or finished.log files, anything there? ID: 36847 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 36859 - Posted: 24 Sep 2018, 16:58:33 UTC - in response to Message 36847. If you complete the picture with - router monitoring - proxy monitoring (if you have one) - host monitoring - VM's console output (job's starting/ending time, top overview) - ... Then you get an idea how a successful task pattern looks like and how typical errors affect this pattern. ID: 36859 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 36861 - Posted: 24 Sep 2018, 18:02:53 UTC - in response to Message 36859. I forgot to kill a running LHCb tasks on one of my PCs - so I checked the properties a minute ago: Runtime 12+ hours, processor time about 1 hour. These tasks are definitely faulty. ID: 36861 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 36866 - Posted: 24 Sep 2018, 19:45:50 UTC - in response to Message 36779. Got 52 tasks last night. Estimated runtime is 28 minutes each, actual 12-18 hours. Tasks are configured to run using 4 cores. No way to finish them all in time if jobs are continuously available. Anyway I let them run and those not started by deadline will be auto-aborted. Referring to my earlier post. As we now haven't any jobs available for these tasks, the predicted estimated runtime may very well be a reality. Now these tasks are all failing after about 21 minutes with EXIT_NO_SUB_TASKS and maybe I can return them before deadline. Sadly with an error on most of them :-( ID: 36866 · Reply Quote

=Lupus= Send message Joined: 28 Sep 04 Posts: 6 Credit: 286,154 RAC: 0	Message 36922 - Posted: 29 Sep 2018, 13:57:22 UTC Last modified: 29 Sep 2018, 13:59:39 UTC Weird Things in lhcb 1.05 log: 2018-09-28 22:16:21 (6464): Detected: vboxwrapper 26197 2018-09-28 22:16:21 (6464): Detected: BOINC client v7.7 2018-09-28 22:16:22 (6464): Detected: VirtualBox VboxManage Interface (Version: 5.2.8) 2018-09-28 22:16:22 (6464): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2018-09-28 22:16:22 (6464): Successfully copied 'init_data.xml' to the shared directory. 2018-09-28 22:16:23 (6464): Create VM. (boinc_58488727ebc83fb2, slot#9) 2018-09-28 22:16:23 (6464): Setting Memory Size for VM. (11148MB) 2018-09-28 22:16:24 (6464): Setting CPU Count for VM. (8) [snip] 2018-09-28 22:16:33 (6464): Successfully started VM. (PID = '18968') 2018-09-28 22:16:33 (6464): Reporting VM Process ID to BOINC. 2018-09-28 22:16:33 (6464): Guest Log: BIOS: VirtualBox 5.2.8 so far everything ok. work going. now the work log shows me some output stuttering/text doubling. when the output is malicious the code is malicious... 2018-09-28 22:21:27 (6464): Guest Log: [INFO] New Job Starting in slot1 2018-09-28 22:21:28 (6464): Guest Log: [INFO] Condor JobID: 473308.169 in slot1 2018-09-28 22:21:28 (6464): Guest Log: [INFO] New Job Starting in slot2 2018-09-28 22:21:28 (6464): Guest Log: [INFO] Condor JobID: 473315.147 in slot2 2018-09-28 22:21:42 (6464): Guest Log: [INFO] Starting pilot in slot1 2018-09-28 22:21:42 (6464): Guest Log: [INFO] Starting pilot in slot2 2018-09-28 22:41:27 (6464): Guest Log: [INFO] Job finished in slot1 with . 2018-09-28 22:41:27 (6464): Guest Log: [INFO] Job finished in slot2 with . 2018-09-28 22:41:46 (6464): Guest Log: [IN[FION]F ON]e wN eJwo bJ oSbt aSrttairntgin g iin ns lsolto3t 2018-09-28 22:41:46 (6464): Guest Log: [IN[FION]F ON]e wN eJwo bJ oSbt aSrttairntgin g iin ns lsolto3t4 2018-09-28 22:41:46 (6464): Guest Log: [INFO] Condor JobID: 473325.157 in slot3 2018-09-28 22:41:46 (6464): Guest Log: [INFO] Condor JobID: 473325.159 in slot4 2018-09-28 22:41:56 (6464): Guest Log: [IN[FOI]N FSOt]a rSttiangr tpiinlgo tp iilno ts liont s3lo 2018-09-28 23:01:32 (6464): Guest Log: [INFO] Job finished in slot3 with . 2018-09-28 23:01:32 (6464): Guest Log: [INFO] Job finished in slot4 with . 2018-09-28 23:01:48 (6464): Guest Log: [INFO] New Job Starting in slot6 2018-09-28 23:01:48 (6464): Guest Log: [INFO] New Job Starting in slot1 2018-09-28 23:01:48 (6464): Guest Log: [INFO] Condor JobID: 473325.506 in slot6 2018-09-28 23:01:48 (6464): Guest Log: [INFO] Condor JobID: 473325.502 in slot1 2018-09-28 23:01:58 (6464): Guest Log: [I[NIFNOF]O ]S tSatratritnign gp ipliolto ti ni ns lsolto1t 2018-09-28 23:01:58 (6464): Guest Log: [I[NIFNOF]O ]S tSatratritnign gp ipliolto ti ni ns lsolto1t6 2018-09-28 23:21:28 (6464): Guest Log: [INFO] New Job Starting in slot4 2018-09-28 23:21:28 (6464): Guest Log: [INFO] New Job Starting in slot2 2018-09-28 23:21:29 (6464): Guest Log: [IN[FOI]N FCOo]n dCoorn dJoorb IJDo:b I D4:7 3 342753.382759. 8i8n4 silno ts2l FYI: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207092319 Ah, by the way, they should run on 8 cores. max they run is 4. =Lupus= ID: 36922 · Reply Quote