Grid (250 words) To improve quality of user support

Grid Help Desk: Process, Procedure
and Practices

 

 

Divya M G ([email protected])

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Henrysukumar ([email protected])

Santhosh J ([email protected])

 

Abstract    (250 words)

To
improve quality of user support Grid help desk extends single point gateway to GARUDA  grid users. All operational, usage and computation
related issues from scientists, academicians, system administrators and network
service providers are effectively managed and addressed.

 

Help
desk has unified portal to receive queries from users. The focus is to provide
suitable and efficient support structure. User can report to help desk through
any mode of communication like telephone, email and web portal. All the
reported issues are converted as tickets at web portal named Request
Tracker. 

 

Maintaining
grid availability and reliability is highly challenging in a federated
environment. To mitigate this suitable tools are deployed, but hard to find a
tool that work as per the need. Hence 
in-house developed automatic scripts are deployed. We have found that
proper monitoring methodology, highly transparent and interactive work culture,
adhering to quality standards gives good results in effective user support.
Another key practice followed is collecting 
“Annual user feedback”, which helps to restructure Grid help desk
guidelines and processes. It is proudly said that transformed ordinary user
support to extraordinary successfully.

 

Help
desk operations are adhering to ISO 9001:2015 quality management systems. This
paper has covered  Grid help desk  operational process, procedures, practices,
experiences, issues and the challenges.

 

Key words

GARUDA,
Grid computing, Help desk, ISO, Quality management system, Request Tracker,

 

1.     
Introduction

High
Performance Computing (HPC) support has grown and growing, it is essentially
required by users namely scientists, researches and academicians to solve their
scientific problems efficiently. HPC may be in the form of cluster, grid, cloud
and virtual systems. In order to enable user application successfully and
quickly apt user support is essential. Support team should have good domain
knowledge, technologies and tools with respect to HPC resource. Resources
includes servers, network, bandwidth, tools, software and libraries. These HPC
support activities at GARUDA grid is extended through Grid Help Desk (GHD). It
is single web interface provided by a web-based portal which gives users a
central point to access information on all HPC functions.

 

We
consider the three important objectives of GHD are primarily, resources
readiness to enable application with all necessary software and libraries.
Secondly, hand holding users to submit job on resource. Finally attending the
issues reported by users.

 

In
this paper GHD activities are demonstrated. The HPC support group at national
grid computing initiative GARUDA 1  is
successfully functioning with well-designed process, procedures and practices.
GHD team is adhering to  International
Organization for Standardization (ISO) standards 2.

 

The
rest of the paper is organized as follows, section 2 presents related works.
Section 3 explains about GARUDA grid architecture. Section 4 discusses Grid
help desk activities in detail. ISO and its impact on help desk is explained at
section 5.  Section 6, 7 and 8 presents
and discusses results, challenges and lessons learnt respectively. Finally,
section 9 concludes the paper.

 

 

 

 

 

 

 

 

2.     
Related work

In
3 the authors have presented the HPC OneStop Team’s  unified customer support activities at Sandia
National Laboratories (SNL). Paper explains how the  HPC OneStop successfully accomplished the
task of providing a “one stop shop” for the users by creating a unified portal
for information access, integrating one ticketing tool to help collaboration
among the various support groups, and tiered HPC support structure.

 

In
4 the authors have discussed some aspects of the user support needed for
service oriented production grids. The key aspects of user support presented  in this paper are addressing the legacy code
support, dynamic user management,  automatic
deployment and monitoring.

 

In
5 the authors have discussed  grid
operations in detail, also the ticketing system
is explained.

 

This
paper will present  Grid help desk  process, procedure and practices.

 

3.     
GARUDA

GARUDA is Indian first national grid initiative bringing
together academic, scientific and research communities for developing their
data and compute intensive applications with guaranteed quality of services. It
is collection of distributed computational resources encompassing of
computational nodes, mass storage, satellite  and scientific instruments spread across India
6. The entire GARUDA software architecture is shown in Figure 1. Grid help
desk  is a component in “Management and
monitoring tools” layer.

 

Figure
1: GARUDA  Software Stack

 

4.     
Grid Help Desk (GHD)

 

4.1   GHD operations structure

 

GHD
has layered support structure in order to increase the efficiency in user
support. The Figure 2 depicts the help desk ecosystem and operation structure.
The teams T1 to T8 has predefined functions. T1 is GARUDA Grid Operation and Administration
team responsible for all aspects of grid operations. T2 is grid research and
development team. T3 is Certification Authority responsible for issuing grid
host and user certificates 7. The certificate authority is called as Indian
Grid Certification Authority (IGCA) which provides X.509 certificates to
support the secure environment in grid computing. It is an accredited member of
the APGridPMA (Asia Pacific Grid Policy Management Authority). T4 is networking
team, any issues related to network will be resolved by them. GARUDA  network is provided by national knowledge
network (NKN). NKN with its multi-gigabit capability is  connecting all universities, research
institutions, libraries, laboratories, healthcare and agricultural institutions
across the country 8.  T5 is grid
support teams from other geographical part of India are part of GARUDA Grid
Operation and Administration (GGOA) 
helps to swift grid user support. T6 is grid security team responsible
for making security policies, conducting security audit, analysing security
attacks so on. T7 is management board, in this context it is project Chief
Investigator.  Major role is to resolve
unresolved issues.  T8 is grid and system
administrators responsible to provide resources for operational activities.

Figure 2: GHD Ecosystem

 

4.2  
GHD operations process

 

User
can report to GHD through telephone, email and web portal. All the reported
issues are converted to ticket at web portal named Request Tracker. More about
this is explained in section 4.5. Within 24 hours user will receive reply with ticket
owner name and its status. Once the issue is resolved user will receive
intimation and updates. We always encourage users to report through web portal.

 

GARUDA
support teams are grid operation, developer, certification authority,
networking,  security, remote support,
management and grid administrator. These are de-centralized support teams hence
the load is distributed among teams. In order to bring high quality in user
support all the activities are prioritized. Every support team (T1 to T9) has
defined expectations and requirements to establish the relative importance of
each one is presented at Figure 2. All reported tickets will reach GHD which is
directly operated by GARUDA Grid Operations and Administration (T1). Any member
from this team will take ownership of the ticket and often is responsible for
resolving the issue. At next layer we have support teams like developer,
certification authority, networking, remote support  and security. Team T1 will escalate the
unresolved tickets to layer 2 teams. Depending upon the nature of the ticket it
will be appropriately escalated to T2 to T6. If ticket is not resolved by 7
days it will be reviewed with management (T7) and ticket status will be changed
from OPEN to STALLED. Either  further
direction will be given to ticket owner to address the reported issue or  it will be re escalated.

 

Every
Monday weekly review meeting scheduled with other operation centres, which is
remote support team through audio/video conference. This meeting helps each one
to share their expertise and discuss any pending issues with manager,  seek suggestions so on. Remote support teams
works as one team, all will use same ticketing system to record and communicate
incidents. All resolved issues are preserved and key tickets are recorded in
“Trouble Shooting Guide”  for future
references.

 

The
initial status of all the tickets  will
be NEW. The NEW tickets are opened by the GGOA member (T1).  The status of that particular ticket changes
to OPEN. The status of any NEW  ticket
will be brought to OPEN status within 1 working day. When ticket is resolved
the report is sent to the requester. 
Quality records are maintained.  The
most affirmative aspect of GHD is that users need not worry where to go for
help and if they are unable to diagnose their HPC problem.  More than 80%
of requests are reported through web portal, around 5% through e mails and
around 15 % through phone.

 

 

 

 

 

4.3  
Statistics

Annually
help desk handles 600 to 700  tickets.
Average ticket handling time is 8  hours.
All the ticket handling  operations are
as per ISO 9001:2015 Quality Management System. Around 5000 tickets handled
during 2010 to 2017. The snapshot  is
shown in Figure 3

 

                                         Figure
3:  Ticket Statistics

 

 

4.5  
Request Tracker

 

RT
9 is the leading enterprise-grade open source issue tracking system. The home
page is shown in Figure 4. It is installed 10 and configured on Linux
platform for bug tracking, help desk ticketing, customer service and workflow
processes. Every reported issue is identified as a ‘Ticket’. Tickets are
classified as minor, major and regular. The criteria for classification depend
on the type of incident reported such as unscheduled jobs, service failure,
issues with login service, Indian Grid Certification Authority query, a
document request is an example of a minor ticket. The ticket related to account
creation, job submission, software installation etc is a major ticket. Tickets
related to installation, configuration, mounting etc at a remote site are
classified as regular tickets. To speed up the user support various queues are
created at portal is shown in Figure 5.