Message boards : LHCb Application : New version v1.05
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 36745 - Posted: 17 Sep 2018, 12:34:22 UTC

This new version uses openhtc.io for CVMFS and can use multiple cores.
ID: 36745 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 36746 - Posted: 17 Sep 2018, 13:11:34 UTC

Interesting, how does it work exactly?
Is one job calculated by one or more cores (like Atlas), or are several jobs calculated by one core (like Theory)?
Does it affect RAM requirements?
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 36746 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,874,573
RAC: 137,197
Message 36748 - Posted: 17 Sep 2018, 15:07:41 UTC - in response to Message 36746.  

Interesting, how does it work exactly?
Is one job calculated by one or more cores (like Atlas), or are several jobs calculated by one core (like Theory)?
Does it affect RAM requirements?

Like Theory.

Each VM runs as many jobs as cores are configured.
Jobs are running independent from each other except that they share data from the VM's CVMFS.
As the scientific app is still the same, a 1-core setup would behave like the old app.

Thus the minimum RAM setting is still 2048 MB.
Each additional job (core) requests additional 1300 MB RAM.

Design time limit remains 12 h.
The VM will then shutdown 10 min after the last job has finished.
ID: 36748 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,921
RAC: 102,416
Message 36749 - Posted: 17 Sep 2018, 16:48:51 UTC - in response to Message 36748.  

Design time limit remains 12 h.
I downloaded the new version plus 1 task. For the first 1%, it took 11 minutes. Which, if the progress is linear, will end up in a total processing time of 1100 minutes or 18+ hours.
ID: 36749 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 36750 - Posted: 17 Sep 2018, 17:34:25 UTC - in response to Message 36749.  
Last modified: 17 Sep 2018, 17:36:03 UTC

Design time limit remains 12 h.
I downloaded the new version plus 1 task. For the first 1%, it took 11 minutes. Which, if the progress is linear, will end up in a total processing time of 1100 minutes or 18+ hours.
Correct!
Like computezrmle wrote, after the running job at the 12 hour mark has finished, the VM will be shutdown after about 10 minutes.
When a job is running more than 6 hours after that 12 hour mark or for other reasons keeps running, the VM will be stopped after 18 hours by the wrapper to avoid wasteless crunching.
This maximum of 18 hours is used for calculation of the remaining run time.
ID: 36750 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,921
RAC: 102,416
Message 36751 - Posted: 17 Sep 2018, 18:30:18 UTC - in response to Message 36750.  

hm, actually, the task got finished after about 25 minutes
ID: 36751 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 36753 - Posted: 18 Sep 2018, 13:09:14 UTC - in response to Message 36751.  

hm, actually, the task got finished after about 25 minutes
When the VM did not get a new job within 10 minutes after the VM have finished one,
the VM will shutdown gracefully and will report a valid task.
When the VM did not run one single job, it will shutdown after 10 minutes and BOINC will report an error.
ID: 36753 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 36759 - Posted: 18 Sep 2018, 19:36:40 UTC
Last modified: 18 Sep 2018, 20:08:05 UTC

This LHCb task looked to be running fine but, on closer inspection, it stopped doing any work after about 7hrs runtime yet failed to terminate until after the 18hr cutoff. Only 1hr actual CPU time reported of the 18hr Elapsed.

From end of the stderr.log:
2018-09-18 08:30:33 (9960): Guest Log: [INFO] Condor JobID: 470428.110 in slot1

2018-09-18 08:30:44 (9960): Guest Log: [INFO] Starting pilot in slot1

2018-09-18 08:52:39 (9960): Guest Log: [INFO] Job finished in slot1 with .

2018-09-18 09:04:34 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 09:04:34 (9960): Status Report: Elapsed Time: '24002.510773'
2018-09-18 09:04:34 (9960): Status Report: CPU Time: '3301.468750'
2018-09-18 10:44:40 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 10:44:40 (9960): Status Report: Elapsed Time: '30002.510773'
2018-09-18 10:44:40 (9960): Status Report: CPU Time: '3395.187500'
2018-09-18 12:24:46 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 12:24:46 (9960): Status Report: Elapsed Time: '36002.510773'
2018-09-18 12:24:46 (9960): Status Report: CPU Time: '3487.781250'
2018-09-18 14:04:53 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 14:04:53 (9960): Status Report: Elapsed Time: '42002.510773'
2018-09-18 14:04:53 (9960): Status Report: CPU Time: '3576.890625'
2018-09-18 15:44:59 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 15:44:59 (9960): Status Report: Elapsed Time: '48002.510773'
2018-09-18 15:44:59 (9960): Status Report: CPU Time: '3666.281250'
2018-09-18 17:25:05 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 17:25:05 (9960): Status Report: Elapsed Time: '54002.510773'
2018-09-18 17:25:05 (9960): Status Report: CPU Time: '3756.390625'
2018-09-18 19:05:12 (9960): Status Report: Job Duration: '64800.000000'
2018-09-18 19:05:12 (9960): Status Report: Elapsed Time: '60002.510773'
2018-09-18 19:05:12 (9960): Status Report: CPU Time: '3844.156250'
2018-09-18 20:25:16 (9960): Powering off VM.
2018-09-18 20:30:17 (9960): VM did not power off when requested.
2018-09-18 20:30:17 (9960): VM was successfully terminated.
2018-09-18 20:30:17 (9960): Deregistering VM. (boinc_4de38852e270c880, slot#2)
2018-09-18 20:30:17 (9960): Removing network bandwidth throttle group from VM.
2018-09-18 20:30:17 (9960): Removing VM from VirtualBox.
20:30:22 (9960): called boinc_finish(0)

Last few lines of the Machine logs:
09/18/18 09:52:39 (pid:9352) Process exited, pid=9365, status=0
09/18/18 09:54:03 (pid:9352) attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (173 to go).

09/18/18 09:54:44 (pid:9352) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43343818
09/18/18 10:13:39 (pid:9352) CCBListener: failed to receive message from CCB server vccondor01.cern.ch
09/18/18 10:13:39 (pid:9352) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
09/18/18 10:15:33 (pid:9352) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#43343818

I reported on -dev that tasks only had about 1/3 CPU time versus Elapsed but didn't look closer at those logs. Maybe those ones did similar to this one? Going for a nosey to compare.

[Multiple Edits] On those -dev tasks, 1 shows jobs running throughout but only used 85mins of 12hrs and another had 7hrs+ CPU time of 18hrs but 9hrs of no activity. Over there and here, all tasks have run and returned jobs (hopefully successfully) but don't seem to self terminate when they fail to get subsequent work so sit idle for the remaining time. I've set "Switch Between Apps" sufficiently high that Tasks should run uninterrupted so it's not a suspend/resume issue.
Hoping this is useful for diagnostics.
ID: 36759 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,874,573
RAC: 137,197
Message 36760 - Posted: 18 Sep 2018, 20:13:46 UTC - in response to Message 36759.  

LHCb should work fine since about 15:15 UTC.
WUs that started before may be affected by unspecific issues.
ID: 36760 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,147,992
RAC: 15,989
Message 36779 - Posted: 19 Sep 2018, 15:30:45 UTC
Last modified: 19 Sep 2018, 15:35:30 UTC

Got 52 tasks last night. Estimated runtime is 28 minutes each, actual 12-18 hours. Tasks are configured to run using 4 cores. No way to finish them all in time if jobs are continuously available. Anyway I let them run and those not started by deadline will be auto-aborted.
ID: 36779 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,921
RAC: 102,416
Message 36837 - Posted: 23 Sep 2018, 15:15:12 UTC

I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes.
Is this by desing, or are these tasks faulty?
ID: 36837 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,874,573
RAC: 137,197
Message 36838 - Posted: 23 Sep 2018, 15:40:45 UTC - in response to Message 36837.  

Sad to say, it's not really working.

The jobs inside the VM are missing some data from or a connection to a backend system at CERN.
Thus they stand by until a timeout shuts them down and the VM requests another job.

I set my systems to NNT.
ID: 36838 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,921
RAC: 102,416
Message 36839 - Posted: 23 Sep 2018, 16:16:14 UTC - in response to Message 36838.  

Sad to say, it's not really working.
we seem to have bad times with LHC lately:

- Theory not working
- LHCb not working
- No Sixtrack for long time

- only ATLAS seems to function the way it's supposed to.
ID: 36839 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36841 - Posted: 23 Sep 2018, 16:43:49 UTC - in response to Message 36838.  
Last modified: 23 Sep 2018, 16:51:58 UTC

Sad to say, it's not really working.

The jobs inside the VM are missing some data from or a connection to a backend system at CERN.
Thus they stand by until a timeout shuts them down and the VM requests another job.

How do you know they are merely standing by and not doing any work? Are there any such indications in the running.log or finished.log files?
ID: 36841 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,921
RAC: 102,416
Message 36844 - Posted: 23 Sep 2018, 18:15:22 UTC - in response to Message 36841.  

How do you know they are merely standing by and not doing any work?
I guess this should be evidence enough:
I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes.
ID: 36844 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36847 - Posted: 23 Sep 2018, 19:36:16 UTC - in response to Message 36844.  

How do you know they are merely standing by and not doing any work?
I guess this should be evidence enough:
I have several LHCb tasks running for more than 3 hours now, and in the properties I notice a processor time of about 20 minutes.


Yes. But in the running.log or finished.log files, anything there?
ID: 36847 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,874,573
RAC: 137,197
Message 36859 - Posted: 24 Sep 2018, 16:58:33 UTC - in response to Message 36847.  

If you complete the picture with
- router monitoring
- proxy monitoring (if you have one)
- host monitoring
- VM's console output (job's starting/ending time, top overview)
- ...

Then you get an idea how a successful task pattern looks like and how typical errors affect this pattern.
ID: 36859 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,921
RAC: 102,416
Message 36861 - Posted: 24 Sep 2018, 18:02:53 UTC - in response to Message 36859.  

I forgot to kill a running LHCb tasks on one of my PCs - so I checked the properties a minute ago: Runtime 12+ hours, processor time about 1 hour.

These tasks are definitely faulty.
ID: 36861 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,147,992
RAC: 15,989
Message 36866 - Posted: 24 Sep 2018, 19:45:50 UTC - in response to Message 36779.  

Got 52 tasks last night. Estimated runtime is 28 minutes each, actual 12-18 hours. Tasks are configured to run using 4 cores. No way to finish them all in time if jobs are continuously available. Anyway I let them run and those not started by deadline will be auto-aborted.

Referring to my earlier post. As we now haven't any jobs available for these tasks, the predicted estimated runtime may very well be a reality. Now these tasks are all failing after about 21 minutes with EXIT_NO_SUB_TASKS and maybe I can return them before deadline. Sadly with an error on most of them :-(
ID: 36866 · Report as offensive     Reply Quote
=Lupus=

Send message
Joined: 28 Sep 04
Posts: 6
Credit: 285,954
RAC: 0
Message 36922 - Posted: 29 Sep 2018, 13:57:22 UTC
Last modified: 29 Sep 2018, 13:59:39 UTC

Weird Things in lhcb 1.05 log:

    2018-09-28 22:16:21 (6464): Detected: vboxwrapper 26197
    2018-09-28 22:16:21 (6464): Detected: BOINC client v7.7
    2018-09-28 22:16:22 (6464): Detected: VirtualBox VboxManage Interface (Version: 5.2.8)
    2018-09-28 22:16:22 (6464): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
    2018-09-28 22:16:22 (6464): Successfully copied 'init_data.xml' to the shared directory.
    2018-09-28 22:16:23 (6464): Create VM. (boinc_58488727ebc83fb2, slot#9)
    2018-09-28 22:16:23 (6464): Setting Memory Size for VM. (11148MB)
    2018-09-28 22:16:24 (6464): Setting CPU Count for VM. (8)
    [snip]
    2018-09-28 22:16:33 (6464): Successfully started VM. (PID = '18968')
    2018-09-28 22:16:33 (6464): Reporting VM Process ID to BOINC.
    2018-09-28 22:16:33 (6464): Guest Log: BIOS: VirtualBox 5.2.8



so far everything ok. work going.
now the work log shows me some output stuttering/text doubling. when the output is malicious the code is malicious...

    2018-09-28 22:21:27 (6464): Guest Log: [INFO] New Job Starting in slot1
    2018-09-28 22:21:28 (6464): Guest Log: [INFO] Condor JobID: 473308.169 in slot1
    2018-09-28 22:21:28 (6464): Guest Log: [INFO] New Job Starting in slot2
    2018-09-28 22:21:28 (6464): Guest Log: [INFO] Condor JobID: 473315.147 in slot2
    2018-09-28 22:21:42 (6464): Guest Log: [INFO] Starting pilot in slot1
    2018-09-28 22:21:42 (6464): Guest Log: [INFO] Starting pilot in slot2
    2018-09-28 22:41:27 (6464): Guest Log: [INFO] Job finished in slot1 with .
    2018-09-28 22:41:27 (6464): Guest Log: [INFO] Job finished in slot2 with .
    2018-09-28 22:41:46 (6464): Guest Log: [IN[FION]F ON]e wN eJwo bJ oSbt aSrttairntgin g iin ns lsolto3t
    2018-09-28 22:41:46 (6464): Guest Log: [IN[FION]F ON]e wN eJwo bJ oSbt aSrttairntgin g iin ns lsolto3t4
    2018-09-28 22:41:46 (6464): Guest Log: [INFO] Condor JobID: 473325.157 in slot3
    2018-09-28 22:41:46 (6464): Guest Log: [INFO] Condor JobID: 473325.159 in slot4
    2018-09-28 22:41:56 (6464): Guest Log: [IN[FOI]N FSOt]a rSttiangr tpiinlgo tp iilno ts liont s3lo
    2018-09-28 23:01:32 (6464): Guest Log: [INFO] Job finished in slot3 with .
    2018-09-28 23:01:32 (6464): Guest Log: [INFO] Job finished in slot4 with .
    2018-09-28 23:01:48 (6464): Guest Log: [INFO] New Job Starting in slot6
    2018-09-28 23:01:48 (6464): Guest Log: [INFO] New Job Starting in slot1
    2018-09-28 23:01:48 (6464): Guest Log: [INFO] Condor JobID: 473325.506 in slot6
    2018-09-28 23:01:48 (6464): Guest Log: [INFO] Condor JobID: 473325.502 in slot1
    2018-09-28 23:01:58 (6464): Guest Log: [I[NIFNOF]O ]S tSatratritnign gp ipliolto ti ni ns lsolto1t
    2018-09-28 23:01:58 (6464): Guest Log: [I[NIFNOF]O ]S tSatratritnign gp ipliolto ti ni ns lsolto1t6
    2018-09-28 23:21:28 (6464): Guest Log: [INFO] New Job Starting in slot4
    2018-09-28 23:21:28 (6464): Guest Log: [INFO] New Job Starting in slot2
    2018-09-28 23:21:29 (6464): Guest Log: [IN[FOI]N FCOo]n dCoorn dJoorb IJDo:b I D4:7 3 342753.382759. 8i8n4 silno ts2l



FYI: https://lhcathome.cern.ch/lhcathome/result.php?resultid=207092319

Ah, by the way, they should run on 8 cores. max they run is 4.

=Lupus=

ID: 36922 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : LHCb Application : New version v1.05


©2024 CERN