Thread 'Tasks download 1.9 GB EVNT files'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2748 Credit: 302,673,444 RAC: 73,679	Message 45446 - Posted: 18 Oct 2021, 7:41:52 UTC Got a couple of tasks that download 1.9 GB EVNT files (each!). That's a bit large. ID: 45446 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,887,454 RAC: 40,359	Message 45447 - Posted: 18 Oct 2021, 7:56:15 UTC - in response to Message 45446. same here. Plus, the image vdi can get as large as about 5GB (in contrast to 3.2GB so far). Also, these tasks seem to use more RAM. The upload file, however, is about 80GB, i.e. smaller than the others before. Also, the tasks have less runtime than the others. However, my problem is that with my RAMDisk 32GB, I cannot process 4 tasks 3 cores each simultaneously (BOINC would not even let me download more than 3 tasks), so I might have to switch to 3 tasks 4 cores ea. No big deal, but somehow interesting. ID: 45447 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,078,801 RAC: 1,115	Message 45448 - Posted: 18 Oct 2021, 9:59:33 UTC Last modified: 18 Oct 2021, 10:03:08 UTC Suspending such a task with LAIM off may let crash the task because of exceeding BOINC's slot disk limit of 10.000.000.000 bytes. Tested it with 2 tasks. 1 task grew up to 10.979.000.000 bytes and the other task 'only' up to 6.827.000.000. First task upload file 86.400 K ID: 45448 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,887,454 RAC: 40,359	Message 45449 - Posted: 18 Oct 2021, 10:34:37 UTC Now, none of these tasks with download size 1.9GB are working any longer. 42 seconds after start, they stop, and in the BOINC manager it says "postponed: VM environment needed to be cleaned up". What kind of problem is this now? ID: 45449 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1553 Credit: 10,078,801 RAC: 1,115	Message 45450 - Posted: 18 Oct 2021, 11:24:13 UTC - in response to Message 45449. My 2 tries with normal fast happy end: https://lhcathome.cern.ch/lhcathome/result.php?resultid=330582566 https://lhcathome.cern.ch/lhcathome/result.php?resultid=330582616 ID: 45450 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,887,454 RAC: 40,359	Message 45451 - Posted: 18 Oct 2021, 11:57:44 UTC - in response to Message 45449. Now, none of these tasks with download size 1.9GB are working any longer. 42 seconds after start, they stop, and in the BOINC manager it says "postponed: VM environment needed to be cleaned up". What kind of problem is this now? well, I opened the Virtual Box Manager, and on the lefthand side I noticed quite a number of tasks which obviously got stuck there, or were not properly deleted after upload (for whatever reason). I removed them all, downloaded new tasks, and they are working well. ID: 45451 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2748 Credit: 302,673,444 RAC: 73,679	Message 45452 - Posted: 18 Oct 2021, 15:48:57 UTC Reported "peak swap sizes" are very variable. Some examples. Different client instances but all are using the same setup. https://lhcathome.cern.ch/lhcathome/result.php?resultid=330582305 34.31 GB (!!) https://lhcathome.cern.ch/lhcathome/result.php?resultid=330581926 2.56 GB Since CMS is currently not running neither CPU nor RAM are under heavy load. ID: 45452 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,887,454 RAC: 40,359	Message 45453 - Posted: 18 Oct 2021, 16:33:21 UTC - in response to Message 45452. Reported "peak swap sizes" are very variable. ... that's interesting, indeed. Maybe it is different with Windows (like in my case) - I now looked up my tasks: in all cases, the value is slightly below 100MB. ID: 45453 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 941 Credit: 781,990,109 RAC: 82,988	Message 45455 - Posted: 18 Oct 2021, 16:47:07 UTC Maybe you get a big peak swap if you quit boinc as you have to save the VM image? the 15 or so I looked though were all less than 100MB. ID: 45455 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,887,454 RAC: 40,359	Message 45457 - Posted: 18 Oct 2021, 18:37:26 UTC with my 32GB Ramdisk, I now cannot even process two 4-core tasks. Only one is working well. Stderr says: "2021-10-18 20:19:07 (532): VM is no longer is a running state. It is in 'lse, errorID=DevATA_DISKFULL message="Host system reported disk full. VM execution is suspended. You can resume after freeing some space" '. 2021-10-18 20:19:07 (532): VM state change detected. (old = 'Running', new = 'lse, errorID=DevATA_DISKFULL message="Host system reported disk full. VM execution is suspended. You can resume after freeing some space" https://lhcathome.cern.ch/lhcathome/result.php?resultid=330604840 No idea how much disk space this new type of ATLAS tasks now needs. What I also notice: after failing, the vm_image.vdi is not being deleted from the "slots" folder. Hence, no new tasks can be downloaded, due to lack of space. Seemingly, these new tasks are faulty. I will stop crunching ATLAS for the moment. ID: 45457 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 941 Credit: 781,990,109 RAC: 82,988	Message 45458 - Posted: 18 Oct 2021, 18:48:01 UTC - in response to Message 45457. I have a few WUs that have an 9GB vm image so these are bigger, maybe with a checkpoint then these can go over 16GB? ID: 45458 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2748 Credit: 302,673,444 RAC: 73,679	Message 45459 - Posted: 18 Oct 2021, 20:00:25 UTC - in response to Message 45455. ... peak swap if you quit boinc as you have to save the VM image? The (my) BOINC clients in question are running nothing but ATLAS native. Usually 24/7 without suspend/resume and without a BOINC client restart. ID: 45459 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 160,887,454 RAC: 40,359	Message 45460 - Posted: 18 Oct 2021, 20:06:25 UTC further, something must be wrong with the credit points calculation: whereas before, for a CPU time of about 14.000 seconds, the credit was around 370, now for the same amount of time, the credit is around 60 :-( ID: 45460 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 941 Credit: 781,990,109 RAC: 82,988	Message 45461 - Posted: 18 Oct 2021, 22:03:01 UTC - in response to Message 45459. That was my thought, I run the same so there is no suspend or resume, so could be smaller? ID: 45461 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 941 Credit: 781,990,109 RAC: 82,988	Message 45462 - Posted: 18 Oct 2021, 22:04:46 UTC - in response to Message 45460. I assume this is just creditnew being the way that it is. I get the same sort of numbers 30k is 280. ID: 45462 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,568,949 RAC: 30,485	Message 45464 - Posted: 18 Oct 2021, 23:28:31 UTC Last modified: 19 Oct 2021, 0:14:27 UTC Atlas Simulation needs 998,93 MB more disk space. You currently have 8537 MB. cvmfs_config reload for a CentOS-VM cleared it and the download is starting the other two from four Atlas tasks. 1.9 GByte File is also downloaded, but no new Application of the Atlas-Applet!! With a Downloadspeed because of the squid-Proxy from 0.7 MBit/s instead of 60 Mbit/s 60 min-downloadtime!! raw-file instead of zip?? ID: 45464 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2748 Credit: 302,673,444 RAC: 73,679	Message 45465 - Posted: 19 Oct 2021, 6:47:32 UTC - in response to Message 45464. ]... because of the squid-Proxy ...[/quote] Slow because of Squid? Surely wrong. Those large files are typical onetimers. This means Squid can't take them from it's caches. Instead each of those files must be downloaded from lhcathome-upload.cern.ch This can be seen in Squid's logfile (-> TCP_MISS:HIER_DIRECT): [pre]xxx 3128 - - [19/Oct/2021:08:15:17 +0200] "GET http://lhcathome-upload.cern.ch/lhcathome/download//225/xxx_EVNT.27082874._000014.pool.root.1 HTTP/1.1" 200 2034165623 "-" "BOINC client (x86_64-pc-linux-gnu 7.17.0)" TCP_MISS:HIER_DIRECT[/pre] Based on my router monitoring I suspect the CERN network can't deliver the files continuously at full speed (it intermittently drops to less than 20 Mbit/s) Nonetheless a download time of 60 min might point out a local bottleneck. ID: 45465 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2748 Credit: 302,673,444 RAC: 73,679	Message 45466 - Posted: 19 Oct 2021, 7:14:52 UTC - in response to Message 45461. That was my thought, I run the same so there is no suspend or resume, so could be smaller? Unlike ATLAS vbox ATLAS native doesn't use VirtualBox (hence no snapshot). ID: 45466 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,568,949 RAC: 30,485	Message 45467 - Posted: 19 Oct 2021, 7:26:01 UTC - in response to Message 45465. Last modified: 19 Oct 2021, 8:12:00 UTC Based on my router monitoring I suspect the CERN network can't deliver the files continuously at full speed (it intermittently drops to less than 20 Mbit/s) Nonetheless a download time of 60 min might point out a local bottleneck. WCG ignore squid and have on all PC's normal traffic (60 MBit/s). atm a new one with 1.89 GByte max. speed 21 MBit/s on this VM: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10694634 What's about frontiere?? In this CentOS8 VM running max. 8 WCG ARP or 4 Atlas-VM! NOW 45 min instead of 60 min download. Can this Atlas-Version be stopped from Cern-IT? ID: 45467 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2298 Credit: 179,568,949 RAC: 30,485	Message 45468 - Posted: 19 Oct 2021, 11:17:28 UTC - in response to Message 45457. I will stop crunching ATLAS for the moment. +1 since one hour. ID: 45468 · Reply Quote