Thread 'Increased file server capacity'

Author	Message
Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 257 Credit: 6,001,083 RAC: 0	Message 33339 - Posted: 14 Dec 2017, 10:05:32 UTC Since Tuesday evening, we have had intermittent issues with upload failures due to a combination of a large number of new hosts running BOINC that co-incidentally joined at the same time as larger ATLAS tasks had been introduced. Our file server capacity has been increased and backlog tasks waiting for upload should upload again soon. (Please refer to the ATLAS application and Number crunching forums for more details.) ID: 33339 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 257 Credit: 6,001,083 RAC: 0	Message 33383 - Posted: 15 Dec 2017, 10:10:05 UTC - in response to Message 33339. We will temporarily stop all our file servers to allow a maintenance operation. All uploads will fail for a while, with a different message. ID: 33383 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1992 Credit: 163,190,286 RAC: 101,403	Message 33518 - Posted: 26 Dec 2017, 14:29:48 UTC As it seems, the situation with not-working downloads as well as uploads has become rather critical by now. Regardless of which LHC sub-project I'd like to run on my 3 PCs - nothing works at all, or only works after numerous retries over hours :-( It seems that some thorough clean-up should be made at CERN, once people are back after the holidays. ID: 33518 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 33520 - Posted: 26 Dec 2017, 14:38:44 UTC - in response to Message 33518. Once I excluded ATLAS, I have had no problems with uploads or downloads on CMS, LHCb or Theory. The only issue seems to be the usual failures on CMS from lack of work, but that is not much of a problem. I await notice of the new hardware upgrade, hopefully next month. ID: 33520 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,948,092 RAC: 8,044	Message 33521 - Posted: 26 Dec 2017, 15:08:06 UTC - in response to Message 33520. Sorry about the current failures with CMS jobs. We seem to be a bit short-handed over Christmas/New Year. :-( ID: 33521 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 33523 - Posted: 26 Dec 2017, 16:07:54 UTC - in response to Message 33521. CMS works often enough (in terms of time) to keep me busy. And best wishes on your trip to Hobart if you go that route. ID: 33523 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,948,092 RAC: 8,044	Message 33530 - Posted: 27 Dec 2017, 15:39:40 UTC - in response to Message 33523. Cheers, Jim; we seem to have CMS jobs in the pipeline again. Not sure about going to Hobart yet, Tom's funeral is set for January 2nd now. ID: 33530 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1992 Credit: 163,190,286 RAC: 101,403	Message 33532 - Posted: 27 Dec 2017, 15:56:41 UTC - in response to Message 33530. we seem to have CMS jobs in the pipeline again. thanks for the good news, Ivan. However, when trying to download, it says "no tasks available for CMS simulation" :-( Although on the Project Status Page, the usual 200 "unsent" tasks are shown. ID: 33532 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1992 Credit: 163,190,286 RAC: 101,403	Message 33534 - Posted: 27 Dec 2017, 16:58:08 UTC - in response to Message 33532. when trying to download, it says "no tasks available for CMS simulation" :-( Although on the Project Status Page, the usual 200 "unsent" tasks are shown. Finally, after almost endless retries, CMS tasks came in. I've experienced the same with all other sub-projects as well, not to talk about ATLAS, for which downloads and uploads almost don't work at all any more. Obviously, the whole I/O operation over there is at the verge of breakdown :-( ID: 33534 · Reply Quote

[AF>Amis des Lapins] Phil1966 Send message Joined: 23 Apr 10 Posts: 5 Credit: 1,452,381 RAC: 0	Message 33654 - Posted: 4 Jan 2018, 18:45:38 UTC Hello still a LOT of upload problems :( Best Phil1966 ID: 33654 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,948,092 RAC: 8,044	Message 33656 - Posted: 4 Jan 2018, 19:40:56 UTC - in response to Message 33654. Is anybody else suffering these upload problems? I haven't noticed it myself but I haven't been watching eagle-eyed -- I've other things to watch for, of course. :-0! ID: 33656 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 958 Credit: 785,212,318 RAC: 111,278	Message 33657 - Posted: 4 Jan 2018, 19:47:40 UTC I still have 8 ATLAS and some SixTrack, I gave up running ATLAS so it's probally lower than it could be. ID: 33657 · Reply Quote

AuxRx Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0	Message 33658 - Posted: 4 Jan 2018, 20:16:55 UTC - in response to Message 33656. Last modified: 4 Jan 2018, 20:17:38 UTC Yes, the problem persists. Additionally the project has been hard to reach in the last 24 hours. I keep cancelling short tasks in the hope of alleviating the file fragment issue, but my reliability/prio has already taken a huge hit from buffering work over the holidays. EDIT: I dropped ATLAS as well. ID: 33658 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,490,006 RAC: 124,853	Message 33659 - Posted: 4 Jan 2018, 20:49:17 UTC - in response to Message 33656. Last modified: 4 Jan 2018, 21:12:05 UTC SixTrack: Didn't get any for more than 24h now although my hosts periodically request them ATLAS: Stopped it as it's very hard to get a task and the upload/reporting also needs several attempts. CMS: RTS queue is still empty :-( LHCb: running but most WUs are very short; this needs to be explained but unfortunately no comment from the project team for months (guess they need an Ivan clone) Theory: running, nothing to complain <edit> Sorry, Theory also has something to complain: "207 (0x000000CF) EXIT_NO_SUB_TASKS" as well as some: Guest Log: [DEBUG] Testing connection to Condor server on port 9618 Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress Guest Log: [DEBUG] 1 Guest Log: [ERROR] Could not connect to Condor server on port 9618 Guest Log: [INFO] Shutting Down. </edit> ID: 33659 · Reply Quote

Paul Sands Send message Joined: 28 Sep 04 Posts: 8 Credit: 21,256,126 RAC: 2,899	Message 33660 - Posted: 4 Jan 2018, 21:02:56 UTC - in response to Message 33656. I have 5 transfers that won't upload. 2 of these have gone past deadline. All 5 of them start with LHC_2015_LHC_2015_260........ ID: 33660 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 33661 - Posted: 4 Jan 2018, 21:17:25 UTC My uploads and downloads have been fine on LHCb, CMS and Theory for several days. But I set up a separate machine (without VirtualBox) to handle native ATLAS and sixtrack, and all I get are "no new tasks available" (Ubuntu 16.04). ID: 33661 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 812 Credit: 66,261,331 RAC: 24,414	Message 33663 - Posted: 4 Jan 2018, 23:37:07 UTC - in response to Message 33656. Yes there are a few sixtrack uploads that uploading more than a week now. They have gone past their deadline but the WUs are still visible on server and haven't reached their quorum. Only maybe half of the finsihed tasks can be uploaded in one go others need multiple retries. With Atlas the situation is about the same. None of the pending uploads are over their deadlines but considering how many you can crunch during one day there are more pending uploads. They usually upload in a couple of days. The downloads are also having troubles but currently none are pending, only one is waiting to be crunched. I have managed to download about 7 Atlas tasks per day for two hosts. At the moment I don't do any other subprojects for LHC. ID: 33663 · Reply Quote

Paul Sands Send message Joined: 28 Sep 04 Posts: 8 Credit: 21,256,126 RAC: 2,899	Message 33674 - Posted: 5 Jan 2018, 15:17:56 UTC - in response to Message 33656. All of mine are failing with this error LHC@home 1/5/2018 9:47:26 AM [error] Error reported by file upload server: [LHC_2015_LHC_2015_260_BOINC_errors__57__s__62.31_60.32__4.3_4.4__5__1.5_1_sixvf_boinc199598_1_r1950326673_0] locked by file_upload_handler PID=-1 ID: 33674 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 33680 - Posted: 6 Jan 2018, 9:33:20 UTC - in response to Message 33674. Last modified: 6 Jan 2018, 9:37:52 UTC I experience similar issues with sixtrack tasks: 1/6/2018 10:04:08 AM \| \| Project communication failed: attempting access to reference site 1/6/2018 10:04:08 AM \| LHC@home \| Temporarily failed download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__4_6__5__15_1_sixvf_boinc1784.zip: transient HTTP error 1/6/2018 10:04:08 AM \| LHC@home \| Backing off 00:04:49 on download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__4_6__5__15_1_sixvf_boinc1784.zip 1/6/2018 10:04:11 AM \| \| Internet access OK - project servers may be temporarily down. 1/6/2018 10:06:17 AM \| LHC@home \| Temporarily failed download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__0_2__5__75_1_sixvf_boinc1770.zip: transient HTTP error 1/6/2018 10:06:17 AM \| LHC@home \| Backing off 00:04:50 on download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__0_2__5__75_1_sixvf_boinc1770.zip 1/6/2018 10:06:18 AM \| \| Project communication failed: attempting access to reference site 1/6/2018 10:06:20 AM \| \| Internet access OK - project servers may be temporarily down. As the problem is on the BOINC server side, I guess that everything will be fixed once our IT experts are back from holidays (please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4541&postid=33453) - I think that, as far as downloading new tasks is concerned, one thing that we can do as crunchers is to increase the amount of stored work. ID: 33680 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 812 Credit: 66,261,331 RAC: 24,414	Message 33682 - Posted: 6 Jan 2018, 9:53:45 UTC - in response to Message 33680. Yep, there is not much we can do but to keep on crunching what we can. The problem with downloads seems to be intermittent. Eventual problem will be when number of stuck uploads reach the number of 2 x CPU cores, then Boinc will stop requesting new tasks for that project. ID: 33682 · Reply Quote