Thread 'Some Validate errors'

Author	Message
Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 29852 - Posted: 6 Apr 2017, 21:45:51 UTC - in response to Message 29848. Last modified: 6 Apr 2017, 21:48:17 UTC Hi , Jim , when i watch at your task list , i notice that some of them finished at the same time curiously : Philippe, I think that just means that they were uploaded at the same time. They actually finished on my machine a few minutes apart, as you can see from the BoincTasks History log: Ga8LDmCc3FqnSu7Ccp2YYBZmABFKDmABFKDm3INKDmYMSKDmcEz7Ao_0 04:06:47 (03:47:52) 4/6/2017 6:50:43 AM 4/6/2017 7:36:29 AM JMhKDmE50FqnSu7Ccp2YYBZmABFKDmABFKDm3INKDmAHSKDmK2i9im_0 04:00:06 (03:41:22) 4/6/2017 6:31:38 AM 4/6/2017 7:36:29 AM So I don't think there is a problem. Regards ID: 29852 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 29854 - Posted: 7 Apr 2017, 5:10:50 UTC since last night, none of the finished and uploaded tasks are being validated - on the task page, the column "points" says "pending". what's the problem? ID: 29854 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 29861 - Posted: 7 Apr 2017, 10:27:06 UTC - in response to Message 29854. Last modified: 7 Apr 2017, 10:27:19 UTC what's the problem? problem obviously solved - all uploaded tasks were validated ID: 29861 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 29866 - Posted: 7 Apr 2017, 16:26:41 UTC - in response to Message 29861. Last modified: 7 Apr 2017, 16:28:11 UTC I found my mistake , the time in the task list focus on the validation moment by the server. The wu is downloaded from the server with a sending time, given in task list. The wu starts at another time in boinc client ; the starting time given in the log. The wu is crunched during the elapsed time. The wu ends at the ending time, given in the log. The wu is uploaded to the server. The wu is stored on the server side, untill the project update by boinc client. The wus are sorted and grouped by host and validated at the time, given in the task list. So it seems normal that several wus can be validated at the same time if the time laps between two project updates with boinc client and the server is longer than the duration of a wu for a small host or for delayed wus for a big host with many simultaneous tasks. (Sorry , but i never pass out). I just wanted to understand if the fact that several wus , validated at the same time , would have an influence on the credit earned, with credit new system whose rules are dynamics,according to the size of the hosts used and the difficulty of some jobs given particularly. ID: 29866 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 31801 - Posted: 4 Aug 2017, 5:50:22 UTC Four PC have problems with start of this workunit: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=73498862 ID: 31801 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 31805 - Posted: 4 Aug 2017, 8:45:47 UTC - in response to Message 31801. Four PC have problems with start of this workunit: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=73498862 One of these machines is my TOP-Cruncher, so I'm shure it is not the machine. Either a problem of the WU or with a Server-Backend or both Supporting BOINC, a great concept ! ID: 31805 · Reply Quote

mrchips Send message Joined: 16 May 14 Posts: 15 Credit: 7,372,628 RAC: 0	Message 31870 - Posted: 7 Aug 2017, 12:44:47 UTC ALL my tasks keep failing with Validate error - WHY ID: 31870 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 3,282	Message 31871 - Posted: 7 Aug 2017, 12:52:26 UTC - in response to Message 31870. ALL my tasks keep failing with Validate error - WHY Take a walk through my checklist and keep an eye especially on Point Nr 2 Supporting BOINC, a great concept ! ID: 31871 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2756 Credit: 304,271,457 RAC: 116,232	Message 31872 - Posted: 7 Aug 2017, 13:38:31 UTC - in response to Message 31870. mrchips wrote: ALL my tasks keep failing with Validate error - WHY The recent batch may probably need more than the automatically configured 4200 MB RAM for a 2-core WU. You may set 5000 MB via app_config.xml. Beside that it's of course a good advise to check your system according to Yeti's checklist. ID: 31872 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 31882 - Posted: 7 Aug 2017, 19:17:46 UTC - in response to Message 31872. The recent batch may probably need more than the automatically configured 4200 MB RAM for a 2-core WU. You may set 5000 MB via app_config.xml. hm, in view of the fact that currently even a 1-core ATLAS WU uses up almost 5000 MB RAM, I would guess that a 2-core WU needs a value beyond this. I might try 2-core WUs tomorrow, so I'll see. ID: 31882 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31885 - Posted: 7 Aug 2017, 21:14:10 UTC - in response to Message 31882. Last modified: 7 Aug 2017, 21:16:54 UTC The recent batch may probably need more than the automatically configured 4200 MB RAM for a 2-core WU. You may set 5000 MB via app_config.xml. hm, in view of the fact that currently even a 1-core ATLAS WU uses up almost 5000 MB RAM, I would guess that a 2-core WU needs a value beyond this. I might try 2-core WUs tomorrow, so I'll see. I set my RAM for 5000 MB via the app_config.xml, and also switched to two CPU cores per task, and no problems since then (all errors before). https://lhcathome.cern.ch/lhcathome/results.php?hostid=10477864&offset=0&show_names=0&state=0&appid=14 In BoincTasks, the real memory usage shows up as 4200 MB for the two-core task. ID: 31885 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2756 Credit: 304,271,457 RAC: 116,232	Message 31888 - Posted: 8 Aug 2017, 6:22:05 UTC As of my understanding there is a RAM usage peak during the initialisation phase of the ATLAS VM. Some of the startup scripts fail if there is not enough RAM and a watchdog shuts down the VM. During the calculation phase the RAM requirement seems to be much lower. VMs that are configured to use more CPU cores have enough RAM to also cover the initialisation phase but 1-core (sometimes 2-core) VMs need a special RAM management, probably during batch generation. But, that's only a guess derived from the error logs. It would be nice if somebody from the project team could post an explanation, e.g. if there is a checklist for batch generation or some quality checks. ID: 31888 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31890 - Posted: 8 Aug 2017, 6:41:01 UTC - in response to Message 31888. Last modified: 8 Aug 2017, 6:51:36 UTC VMs that are configured to use more CPU cores have enough RAM to also cover the initialisation phase but 1-core (sometimes 2-core) VMs need a special RAM management, probably during batch generation. I am trying the single CPU core tasks again (still with 5000 MB memory). The reason is that the two-core tasks do not honor the core reservation that I have set up for my GPU in an app_config.xml for GPU Grid. That is, ATLAS will run four tasks and once, while the GPU also uses a core, so that nine cores are being allocated. That results in slow running on the GPU due to CPU starvation, not a desirable condition. If I encounter memory problems with this configuration, I can limit the number of ATLAS tasks via an app_config.xml, but that may lead to scheduling problems as you probably know, and I would prefer to avoid it. But unless several ATLAS tasks are in start-up phase all at the same time, it should not be a problem with 32 GB of memory. ID: 31890 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 31893 - Posted: 8 Aug 2017, 16:14:15 UTC - in response to Message 31888. Last modified: 8 Aug 2017, 16:14:49 UTC computezrmle wrote: As of my understanding there is a RAM usage peak during the initialisation phase of the ATLAS VM. Some of the startup scripts fail if there is not enough RAM and a watchdog shuts down the VM. During the calculation phase the RAM requirement seems to be much lower. hm, honestly, I have never made this observation, although having run 1-core, 2-core and 3-core ATLAS tasks during the course of time for more about 1 1/2 years now. What I watch is that there is a rather quick increase in RAM usage right from the beginning on, than the increase becomes more flat, and stays at a pretty much even level till the task is finished. ID: 31893 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1986 Credit: 162,172,797 RAC: 87,016	Message 31894 - Posted: 8 Aug 2017, 16:34:11 UTC - in response to Message 31890. Jim1348 wrote: I am trying the single CPU core tasks again (still with 5000 MB memory). The reason is that the two-core tasks do not honor the core reservation that I have set up for my GPU in an app_config.xml for GPU Grid. That is, ATLAS will run four tasks and once, while the GPU also uses a core, so that nine cores are being allocated. That results in slow running on the GPU due to CPU starvation, not a desirable condition. hm, this sounds interesting. I am also running LHC tasks and GPUGRID tasks concurrently on 2 of my PCs, but never had any problems with CPU core Reservation/allocation. Only for LHC I use various app_config.xml's (depending on what configuration I'l like to run), but for GPUGRID I never had an app_config.xml. For example, my main PC: 12-core CPU (= 6 core + 6 HT), 32 GB RAM. 2 GPUs, 1 GPUGRID task running on each (totally automatically, no need for an app_config.xml) Currently, 4 2-core ATLAS tasks also running (app_config.xml for higher RAM). This uses a total of 10 CPU cores out of my 12 cores, percentage of total CPU usage is shown as ~ 86%. Total RAM usage (once the ATLAS tasks are running beyond startup phase): ~ 22,5 GB. What surprises me, regarding RAM usage, is the following: When I ran 1-core ATLAS tasks (till yesterday), I had to increase RAM availability per task to 5000MB (via app_config.xml), otherwise the tasks failed after 10-14 minutes. From what I could measure, each task indeed used close to 5000MB. Now the 2-core ATLAS tasks use not more than slightly above 5000MB each, whereas I would have expected (based on how much the 1-core tasks had needed) some value between 7000 and 7500MB. This is really strange. ID: 31894 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 31895 - Posted: 8 Aug 2017, 16:51:26 UTC The initialisation and end phases of ATLAS tasks are single core, while the calculation phase is multicore. If the highest RAM requirements are from the initialisation phase, then it is logical to have similar RAM requirements for 1-core and 2-core tasks. We are the product of random evolution. ID: 31895 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2756 Credit: 304,271,457 RAC: 116,232	Message 31896 - Posted: 8 Aug 2017, 17:15:20 UTC Erich56 wrote: What I watch is that there is a rather quick increase in RAM usage right from the beginning on, than the increase becomes more flat, and stays at a pretty much even level till the task is finished. There are different perspectives. (1) from outside the VM if you keep an eye on the host's monitoring apps (2) from inside the VM if you look at the top console (which we still don't have in ATLAS) IIRC (1) shows the maximum amount of RAM a VM has allocated since startup as VirtualBox would never give it back to the OS. I have no evidence for my previous guess. It's all speculation but it's obvious that (much) more RAM given to the VM avoids errors with the recent batch. HerveUAE wrote: The initialisation and end phases of ATLAS tasks are single core, while the calculation phase is multicore. If the highest RAM requirements are from the initialisation phase, then it is logical to have similar RAM requirements for 1-core and 2-core tasks. Good point. ID: 31896 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 31897 - Posted: 8 Aug 2017, 17:25:29 UTC - in response to Message 31894. Last modified: 8 Aug 2017, 17:38:28 UTC I am also running LHC tasks and GPUGRID tasks concurrently on 2 of my PCs, but never had any problems with CPU core Reservation/allocation. Only for LHC I use various app_config.xml's (depending on what configuration I'l like to run), but for GPUGRID I never had an app_config.xml. For example, my main PC: 12-core CPU (= 6 core + 6 HT), 32 GB RAM. 2 GPUs, 1 GPUGRID task running on each (totally automatically, no need for an app_config.xml) Currently, 4 2-core ATLAS tasks also running (app_config.xml for higher RAM). This uses a total of 10 CPU cores out of my 12 cores, percentage of total CPU usage is shown as ~ 86%. Total RAM usage (once the ATLAS tasks are running beyond startup phase): ~ 22,5 GB. In your case, it doesn't really matter if you use an app_config.xml for GPU Grid, since you have four extra cores to feed the two cards. However, if LHC (or any other project) were allowed to run on all the cores, then they would use all 12 cores. By default, GPU Grid does not reserve a whole core for itself, so it would have to run on only a partial core, slowing it down. In that case, you normally use an app_config.xml to reserve a whole core for each GPU for best performance. ID: 31897 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 33131 - Posted: 25 Nov 2017, 10:18:18 UTC This Workunit doesn't finished for more than four PC's: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=80187546 ID: 33131 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,660,212 RAC: 3	Message 33133 - Posted: 25 Nov 2017, 13:36:19 UTC same here: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=80189578 ID: 33133 · Reply Quote