Message boards : News : Increased file server capacity
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 247
Credit: 5,974,599
RAC: 0
Message 33339 - Posted: 14 Dec 2017, 10:05:32 UTC

Since Tuesday evening, we have had intermittent issues with upload failures due to a combination of a large number of new hosts running BOINC that co-incidentally joined at the same time as larger ATLAS tasks had been introduced. Our file server capacity has been increased and backlog tasks waiting for upload should upload again soon. (Please refer to the ATLAS application and Number crunching forums for more details.)
ID: 33339 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 247
Credit: 5,974,599
RAC: 0
Message 33383 - Posted: 15 Dec 2017, 10:10:05 UTC - in response to Message 33339.  

We will temporarily stop all our file servers to allow a maintenance operation. All uploads will fail for a while, with a different message.
ID: 33383 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,406,250
RAC: 75,160
Message 33518 - Posted: 26 Dec 2017, 14:29:48 UTC

As it seems, the situation with not-working downloads as well as uploads has become rather critical by now.

Regardless of which LHC sub-project I'd like to run on my 3 PCs - nothing works at all, or only works after numerous retries over hours :-(

It seems that some thorough clean-up should be made at CERN, once people are back after the holidays.
ID: 33518 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33520 - Posted: 26 Dec 2017, 14:38:44 UTC - in response to Message 33518.  

Once I excluded ATLAS, I have had no problems with uploads or downloads on CMS, LHCb or Theory. The only issue seems to be the usual failures on CMS from lack of work, but that is not much of a problem.
I await notice of the new hardware upgrade, hopefully next month.
ID: 33520 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,673,225
RAC: 6,739
Message 33521 - Posted: 26 Dec 2017, 15:08:06 UTC - in response to Message 33520.  

Sorry about the current failures with CMS jobs. We seem to be a bit short-handed over Christmas/New Year. :-(
ID: 33521 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33523 - Posted: 26 Dec 2017, 16:07:54 UTC - in response to Message 33521.  

CMS works often enough (in terms of time) to keep me busy. And best wishes on your trip to Hobart if you go that route.
ID: 33523 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,673,225
RAC: 6,739
Message 33530 - Posted: 27 Dec 2017, 15:39:40 UTC - in response to Message 33523.  

Cheers, Jim; we seem to have CMS jobs in the pipeline again. Not sure about going to Hobart yet, Tom's funeral is set for January 2nd now.
ID: 33530 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,406,250
RAC: 75,160
Message 33532 - Posted: 27 Dec 2017, 15:56:41 UTC - in response to Message 33530.  

we seem to have CMS jobs in the pipeline again.
thanks for the good news, Ivan. However, when trying to download, it says "no tasks available for CMS simulation" :-(
Although on the Project Status Page, the usual 200 "unsent" tasks are shown.
ID: 33532 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1786
Credit: 117,406,250
RAC: 75,160
Message 33534 - Posted: 27 Dec 2017, 16:58:08 UTC - in response to Message 33532.  

when trying to download, it says "no tasks available for CMS simulation" :-(
Although on the Project Status Page, the usual 200 "unsent" tasks are shown.
Finally, after almost endless retries, CMS tasks came in.
I've experienced the same with all other sub-projects as well, not to talk about ATLAS, for which downloads and uploads almost don't work at all any more.

Obviously, the whole I/O operation over there is at the verge of breakdown :-(
ID: 33534 · Report as offensive     Reply Quote
Profile [AF>Amis des Lapins] Phil1966

Send message
Joined: 23 Apr 10
Posts: 5
Credit: 1,349,240
RAC: 0
Message 33654 - Posted: 4 Jan 2018, 18:45:38 UTC

Hello
still a LOT of upload problems :(
Best
Phil1966
ID: 33654 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1056
Credit: 7,673,225
RAC: 6,739
Message 33656 - Posted: 4 Jan 2018, 19:40:56 UTC - in response to Message 33654.  

Is anybody else suffering these upload problems? I haven't noticed it myself but I haven't been watching eagle-eyed -- I've other things to watch for, of course. :-0!
ID: 33656 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 831
Credit: 688,596,166
RAC: 140,083
Message 33657 - Posted: 4 Jan 2018, 19:47:40 UTC

I still have 8 ATLAS and some SixTrack, I gave up running ATLAS so it's probally lower than it could be.
ID: 33657 · Report as offensive     Reply Quote
AuxRx

Send message
Joined: 16 Sep 17
Posts: 100
Credit: 1,618,469
RAC: 0
Message 33658 - Posted: 4 Jan 2018, 20:16:55 UTC - in response to Message 33656.  
Last modified: 4 Jan 2018, 20:17:38 UTC

Yes, the problem persists. Additionally the project has been hard to reach in the last 24 hours. I keep cancelling short tasks in the hope of alleviating the file fragment issue, but my reliability/prio has already taken a huge hit from buffering work over the holidays.

EDIT: I dropped ATLAS as well.
ID: 33658 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2520
Credit: 252,117,883
RAC: 132,344
Message 33659 - Posted: 4 Jan 2018, 20:49:17 UTC - in response to Message 33656.  
Last modified: 4 Jan 2018, 21:12:05 UTC

SixTrack: Didn't get any for more than 24h now although my hosts periodically request them
ATLAS: Stopped it as it's very hard to get a task and the upload/reporting also needs several attempts.
CMS: RTS queue is still empty :-(
LHCb: running but most WUs are very short; this needs to be explained but unfortunately no comment from the project team for months (guess they need an Ivan clone)
Theory: running, nothing to complain


<edit>

Sorry, Theory also has something to complain:
"207 (0x000000CF) EXIT_NO_SUB_TASKS"
as well as some:
Guest Log: [DEBUG] Testing connection to Condor server on port 9618
Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
Guest Log: [DEBUG] 1
Guest Log: [ERROR] Could not connect to Condor server on port 9618
Guest Log: [INFO] Shutting Down.

</edit>
ID: 33659 · Report as offensive     Reply Quote
Paul Sands

Send message
Joined: 28 Sep 04
Posts: 6
Credit: 15,190,588
RAC: 3,468
Message 33660 - Posted: 4 Jan 2018, 21:02:56 UTC - in response to Message 33656.  

I have 5 transfers that won't upload. 2 of these have gone past deadline. All 5 of them start with LHC_2015_LHC_2015_260........
ID: 33660 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33661 - Posted: 4 Jan 2018, 21:17:25 UTC

My uploads and downloads have been fine on LHCb, CMS and Theory for several days. But I set up a separate machine (without VirtualBox) to handle native ATLAS and sixtrack, and all I get are "no new tasks available" (Ubuntu 16.04).
ID: 33661 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 722
Credit: 48,377,245
RAC: 30,138
Message 33663 - Posted: 4 Jan 2018, 23:37:07 UTC - in response to Message 33656.  

Yes there are a few sixtrack uploads that uploading more than a week now. They have gone past their deadline but the WUs are still visible on server and haven't reached their quorum. Only maybe half of the finsihed tasks can be uploaded in one go others need multiple retries.

With Atlas the situation is about the same. None of the pending uploads are over their deadlines but considering how many you can crunch during one day there are more pending uploads. They usually upload in a couple of days. The downloads are also having troubles but currently none are pending, only one is waiting to be crunched. I have managed to download about 7 Atlas tasks per day for two hosts.

At the moment I don't do any other subprojects for LHC.
ID: 33663 · Report as offensive     Reply Quote
Paul Sands

Send message
Joined: 28 Sep 04
Posts: 6
Credit: 15,190,588
RAC: 3,468
Message 33674 - Posted: 5 Jan 2018, 15:17:56 UTC - in response to Message 33656.  

All of mine are failing with this error
LHC@home 1/5/2018 9:47:26 AM [error] Error reported by file upload server: [LHC_2015_LHC_2015_260_BOINC_errors__57__s__62.31_60.32__4.3_4.4__5__1.5_1_sixvf_boinc199598_1_r1950326673_0] locked by file_upload_handler PID=-1
ID: 33674 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 33680 - Posted: 6 Jan 2018, 9:33:20 UTC - in response to Message 33674.  
Last modified: 6 Jan 2018, 9:37:52 UTC

I experience similar issues with sixtrack tasks:

1/6/2018 10:04:08 AM |  | Project communication failed: attempting access to reference site
1/6/2018 10:04:08 AM | LHC@home | Temporarily failed download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__4_6__5__15_1_sixvf_boinc1784.zip: transient HTTP error
1/6/2018 10:04:08 AM | LHC@home | Backing off 00:04:49 on download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__4_6__5__15_1_sixvf_boinc1784.zip
1/6/2018 10:04:11 AM |  | Internet access OK - project servers may be temporarily down.
1/6/2018 10:06:17 AM | LHC@home | Temporarily failed download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__0_2__5__75_1_sixvf_boinc1770.zip: transient HTTP error
1/6/2018 10:06:17 AM | LHC@home | Backing off 00:04:50 on download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__0_2__5__75_1_sixvf_boinc1770.zip
1/6/2018 10:06:18 AM |  | Project communication failed: attempting access to reference site
1/6/2018 10:06:20 AM |  | Internet access OK - project servers may be temporarily down.


As the problem is on the BOINC server side, I guess that everything will be fixed once our IT experts are back from holidays (please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4541&postid=33453) - I think that, as far as downloading new tasks is concerned, one thing that we can do as crunchers is to increase the amount of stored work.
ID: 33680 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 722
Credit: 48,377,245
RAC: 30,138
Message 33682 - Posted: 6 Jan 2018, 9:53:45 UTC - in response to Message 33680.  

Yep, there is not much we can do but to keep on crunching what we can. The problem with downloads seems to be intermittent. Eventual problem will be when number of stuck uploads reach the number of 2 x CPU cores, then Boinc will stop requesting new tasks for that project.
ID: 33682 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Increased file server capacity


©2024 CERN