Message boards : ATLAS application : Confused
Message board moderation

To post messages, you must log in.

AuthorMessage
keputnam

Send message
Joined: 27 Sep 04
Posts: 94
Credit: 3,758,301
RAC: 4,865
Message 43305 - Posted: 4 Sep 2020, 17:44:29 UTC
Last modified: 4 Sep 2020, 17:49:23 UTC

Can anyone explain this to me


Run Time CPU Time Credit
120,886.49 140,595.50 590.26
107,254.38 36,754.30 2,149.71



Also I have another another complaint about the Scheduler. I had five WUs cancelled by the server because I would have returned them late Fair enough

But it sent me another 15 with an 8 day return date !

There is no way in hell I'll get through them all on time

What gives?
ID: 43305 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1513
Credit: 49,674,385
RAC: 154,627
Message 43306 - Posted: 4 Sep 2020, 19:08:50 UTC - in response to Message 43305.  

Can anyone explain this to me
Run Time CPU Time Credit
120,886.49 140,595.50 590.26
107,254.38 36,754.30 2,149.71

This is the good news:
2020-09-04 08:39:58 (11796): Guest Log: Looking for outputfile HITS.22420244._023297.pool.root.1
2020-09-04 08:39:59 (11796): Guest Log: HITS file was successfully produced

You have started this Atlas many times again, so it will begin from the first collision up to 200.
The Server abording is, because of a small window from only three days.
If the Atlas-Task is not send back to the Server, Cern have the Agile Boincers to finish them in a short time.
They need a result in less than a few days, not in a week the task starting with.

You can tray a app_config to get only ONE Atlas a time to have time to finish it,
but no interrrupt when the task is running.
If this is not possible from your side, don't let ATLAS running, sorry.
ID: 43306 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43307 - Posted: 4 Sep 2020, 19:32:48 UTC

Would like to point out "Cancelled by server" is common factor on many project and it would only affect task not started before valid task is done to complete workunits requirement of total valid.
This is a way to reduce waste of computation that is not needed and your client would be happy to fetch other task.
This happen every day to me for my host on other projects.

This would happen if other host have short "Average turnaround time" while others host have long time. If you have set Network to be active to "always" and have decent bandwidth try to reduce it. To do that reduce in settings at computing -> "Store up to an additional x days of work. Would suggest it to max 1 day. Your host report around 3 days and many task would be cancelled as it not needed and waste bandwidth and storage on your host.
If client handle several project you could do lower as others projects would work as backup project if servers is down or empty.
ID: 43307 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 94
Credit: 3,758,301
RAC: 4,865
Message 43308 - Posted: 4 Sep 2020, 19:57:05 UTC - in response to Message 43306.  

Already have an app_config with max_concurrent set to 1

We took three power hits over about 19 hours, so that would account for three restarts

I think I did one for Windows maintenance, too




As for the aborted tasks, and ridiculous number of new tasks sent

I haven't changed relative resource share in over 6 months, nor added any new projects and the scheduler SHOULD be smart enough to not send me work I will never complete on time
ID: 43308 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 94
Credit: 3,758,301
RAC: 4,865
Message 43309 - Posted: 4 Sep 2020, 19:59:09 UTC - in response to Message 43307.  

Oh, i realize the purpose, but this has only ever happened on ATLAS as far as I can remember

And this is the third go round on the circus for me

ID: 43309 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 430
Credit: 117,525,067
RAC: 0
Message 43665 - Posted: 21 Nov 2020, 22:59:09 UTC

Just saw this thread.

The scheduler was designed in a time, where only Single-Core-WUs exist and with this it works very fine.

The scheduler has really problems to balance with Multi-Core-WUs; if you like to run these, it may be neccessary to help the scheduler. At LHC you have the possibility to tell "Give me only 1 Workunit". This is setup here: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project

Choose 'Max # jobs' and set it to one or two or whatever you would like


Supporting BOINC, a great concept !
ID: 43665 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 157
Credit: 14,665,461
RAC: 0
Message 43787 - Posted: 4 Dec 2020, 23:27:31 UTC - in response to Message 43665.  

The scheduler has really problems to balance with Multi-Core-WUs; if you like to run these, it may be neccessary to help the scheduler.

I don't have too many problems with mixing multi-core Atlas with Theory or SixTrack on 10617965. Keeping a steady stream of jobs on hand helps. Making changes such as to #Cores will upset the client - I just grit my teeth and wait a couple of days while it sorts itself out.
I find that trying to micro-manage BOINC makes things worse, not better. :(
ID: 43787 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 589
Credit: 21,798,687
RAC: 3,377
Message 43788 - Posted: 5 Dec 2020, 0:11:43 UTC - in response to Message 43787.  

Making changes such as to #Cores will upset the client - I just grit my teeth and wait a couple of days while it sorts itself out.
I find that trying to micro-manage BOINC makes things worse, not better. :(

Right. But sometimes it fixes things. I could only get four native ATLAS at a time for my 12-core Ryzen 3600, until I changed Max # CPUs from 1 to 2.
Now it has sent me four more, for a total of eight. You never know how it will react.
ID: 43788 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 294
Credit: 2,486,747
RAC: 1,440
Message 43799 - Posted: 6 Dec 2020, 18:19:14 UTC - in response to Message 43788.  
Last modified: 6 Dec 2020, 18:26:43 UTC

Making changes such as to #Cores will upset the client - I just grit my teeth and wait a couple of days while it sorts itself out.
I find that trying to micro-manage BOINC makes things worse, not better. :(

Right. But sometimes it fixes things. I could only get four native ATLAS at a time for my 12-core Ryzen 3600, until I changed Max # CPUs from 1 to 2.
Now it has sent me four more, for a total of eight. You never know how it will react.



I have 16 core unit and I run a bunch of different projects. So in order to keep ATLAS from crashing I set in the web settings to use only 4 cores, but also in my app_config I have a restriction for 4 cores only. Memory is set for 6600MB and I have it load 8 jobs into the queue. This works just fine.
ID: 43799 · Report as offensive     Reply Quote

Message boards : ATLAS application : Confused


©2022 CERN